As the number of cores in high performance computing (HPC) systems continues to grow, the mean-time-to-failure (MTTF) for large HPC systems is becoming shorter than the execution time of many HPC applications. Fault tolerance is becoming one of the critical techniques for the effective use of large HPC systems.
This project develops highly efficient algorithmic fault tolerance techniques for selected linear algebra computations to tolerate both fail-stop and fail-continue failures. Fail-stop failures, where the failed computation crashes, are often tolerated by checkpoint. This project removes checkpoint from fault tolerance for selected linear algebra computations so that neither checkpoint nor rollback is necessary for the protection of these computations. Fail-continue failures, where the corrupted computation continues to make progress but the computation results cannot be trusted any more, are usually tolerated offline by checking the computation results after the computation finishes. This project designs novel online fault tolerance techniques to detect fail-continue failures in the middle of the computation so that better efficiency can be achieved by stopping the corrupted computations in the middle of the computation in a timely manner.