As the number of cores in high performance computing (HPC) systems continues to grow, the mean-time-to-failure (MTTF) for large HPC systems is becoming shorter than the execution time of many HPC applications. Fault tolerance is becoming one of the critical techniques for the effective use of large HPC systems.

This project develops highly efficient algorithmic fault tolerance techniques for selected linear algebra computations to tolerate both fail-stop and fail-continue failures. Fail-stop failures, where the failed computation crashes, are often tolerated by checkpoint. This project removes checkpoint from fault tolerance for selected linear algebra computations so that neither checkpoint nor rollback is necessary for the protection of these computations. Fail-continue failures, where the corrupted computation continues to make progress but the computation results cannot be trusted any more, are usually tolerated offline by checking the computation results after the computation finishes. This project designs novel online fault tolerance techniques to detect fail-continue failures in the middle of the computation so that better efficiency can be achieved by stopping the corrupted computations in the middle of the computation in a timely manner.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1305622
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2012-09-01
Budget End
2016-07-31
Support Year
Fiscal Year
2013
Total Cost
$340,913
Indirect Cost
Name
University of California Riverside
Department
Type
DUNS #
City
Riverside
State
CA
Country
United States
Zip Code
92521