This work builds upon the existing open source, user-space DMTCP package for transparent, distributed checkpointing. Three goals will be accomplished: (i) checkpoint-restart of long-running computations on the desktop; (ii) save-restore of interactive software packages; and (iii) a universal reversible debugger. The first two goals will allow software development teams to add to their package a reliable "save workspace" feature --- with no requirement for a kernel module or other privileged operations. The third goal is to enhance any debugger with reversibility (e.g. a back-step command), and with a reverse expression watchpoint command to move backwards from a software error to the original software fault.

INTELLECTUAL MERIT: While checkpointing has existed for over 20 years, earlier packages were difficult to maintain. The unprivileged, user-space design of DMTCP has a five-year track record. It is ideal for integration into other software, where any end-user requirement for installation of a kernel module or other administrative privilege is incompatible with widespread distribution. Finally, DMTCP is the first package able to directly checkpoint a gdb session (the gdb process and its target process) -- a key feature for the envisioned new type of reversible debugger.

BROADER IMPACT: Checkpointing and process migration have long been of interest for science and engineering, but too often suffered from software fragility or special requirements. The DMTCP approach removes these obstacles. Further, the wider use of ``time-traveling (reversible) debuggers'' will greatly accelerate software development due to the greater ease of finding bugs. A NIST report estimates the cost of software bugs to the economy at $59.5 billion per year. Finally, the excitement factor of checkpoint-restart on the desktop helps attract and motivate students toward the learning of sometimes arcane systems issues in this critical technology.

Project Report

Checkpoint-restart allows one to save a running computer program or computation to disk, and later to restart it from where it stopped. This is important for long-running programs or long interactive sessions. A checkpoint-restart package, DMTCP (Distributed MultiThreaded CheckPointing), has been designed around the idea of plugins for process virtualization. Plugins allow it to adapt to an execution environment that is different on restart, as compared to prior to checkpoint. This new platform was the key to creating a modest-sized, easily maintainable support for the several results described below. The checkpoint-restart package is distributed as free and open-source software at http://dmtcp.sourceforge.net . It is also available as a package for the most common Linux distributions. The DMTCP forum provides free technical support. This platform enabled three especially significant results (firsts for checkpointing) for which two are published and one represents ongoing work. Checkpoint-restart has been extended to directly support checkpointing over the InfiniBand network. This eliminates the need for an MPI-specific checkpoint-restart service to "tear down the network", delegate to a single-host checkpointing package, and then "re-build the network". Checkpoint-restart has been extended to checkpointing of a network of virtual machines (QEMU over KVM, while checkpointing the Tun/Tap bridge network). This has important applications for Cloud Computing. Checkpoint-restart has been extended to provide support for checkpointing the state of programmable GPUs (using modern shaders). The mechanism demonstrated is: record-prune-replay. In particular, the support for InfiniBand, in conjunction with a principled implementation of checkpoint-restart for the ssh protocol allow DMTCP to transparently checkpoint various dialects of MPI. This technology provides the potential in the future to transparently checkpoint and restart entire batch queues for high performance computing (HPC). Such a capability will make more efficient use of expensive computer resources in HPC. In addition, a reversible debugger, FReD, has been demonstrated as a proof-of-principle. This supports a novel debugging technique, "reverse expression watchpoint" (also called "reverse transition watchpoint") has been demonstrated. A single mechanism for supporting GDB, the Python debugger and the Matlab debugger has been demonstrated, and can be easily extended to additional debuggers.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
0960978
Program Officer
Kevin L. Thompson
Project Start
Project End
Budget Start
2010-05-15
Budget End
2014-04-30
Support Year
Fiscal Year
2009
Total Cost
$376,870
Indirect Cost
Name
Northeastern University
Department
Type
DUNS #
City
Boston
State
MA
Country
United States
Zip Code
02115