The project will design and implement a pre-compiler and runtime system that will enable an existing parallel C or Fortran program, which uses MPI or OpenMp library calls for parallelism, to checkpoint and restore its computational state. Unlike the checkpoints taken by a system like Condor, the checkpoints produced by these system will be portable in the sense that they can be used to restart the application on another machine of a direrent processor type and conguration. The projects goal is to accomplish this with minimal impact on the performance and scalability of the application. The project will develop, whereas a complete system-level checkpoint for the same application on a large machine will need to save terabytes of data to disk. In a grid environment, the checkpoints taken by system-level checkpointing are not portable, so computations cannot migrate to take advantage of changing resource availability.

At present, programmers who wish to use application-level checkpointing must analyze and instrument their code manually. Our proposal system will automate this, requiring only minimal input from the programmer. Accomplishing this requires new checkpoint protocols for process coordination, some of which we have already developed and implemented. It also requires inter-procedural program analysis techniques and a sophisticated runtime system, which we will implement in this project. The approach we are taking requires expertise in high-performance computing, program analysis runtime system design, and distributed computing. We believe our project team can rise to the challenge.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0406345
Program Officer
Frederica Darema
Project Start
Project End
Budget Start
2004-08-15
Budget End
2007-10-31
Support Year
Fiscal Year
2004
Total Cost
$600,000
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850