The project will design and implement a pre-compiler and runtime system that will enable an existing parallel C or Fortran program, which uses MPI or OpenMp library calls for parallelism, to checkpoint and restore its computational state. Unlike the checkpoints taken by a system like Condor, the checkpoints produced by these system will be portable in the sense that they can be used to restart the application on another machine of a direrent processor type and conguration. The projects goal is to accomplish this with minimal impact on the performance and scalability of the application. The project will develop, whereas a complete system-level checkpoint for the same application on a large machine will need to save terabytes of data to disk. In a grid environment, the checkpoints taken by system-level checkpointing are not portable, so computations cannot migrate to take advantage of changing resource availability.

At present, programmers who wish to use application-level checkpointing must analyze and instrument their code manually. Our proposal system will automate this, requiring only minimal input from the programmer. Accomplishing this requires new checkpoint protocols for process coordination, some of which we have already developed and implemented. It also requires inter-procedural program analysis techniques and a sophisticated runtime system, which we will implement in this project. The approach we are taking requires expertise in high-performance computing, program analysis runtime system design, and distributed computing. We believe our project team can rise to the challenge.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0739601
Program Officer
Anita J. LaSalle
Project Start
Project End
Budget Start
2007-01-01
Budget End
2008-12-31
Support Year
Fiscal Year
2007
Total Cost
$485,818
Indirect Cost
Name
University of Texas Austin
Department
Type
DUNS #
City
Austin
State
TX
Country
United States
Zip Code
78712