SHF: Small: FTLA: Fault Tolerant Linear Algebra Software for Massively Parallel Architectures

Chen, Zizhong

Abstract

As the number of cores in high performance computing (HPC) systems continues to grow, the mean-time-to-failure (MTTF) for large HPC systems is becoming shorter than the execution time of many HPC applications. Fault tolerance is becoming one of the critical techniques for the effective use of large HPC systems.

This project develops highly efficient algorithmic fault tolerance techniques for selected linear algebra computations to tolerate both fail-stop and fail-continue failures. Fail-stop failures, where the failed computation crashes, are often tolerated by checkpoint. This project removes checkpoint from fault tolerance for selected linear algebra computations so that neither checkpoint nor rollback is necessary for the protection of these computations. Fail-continue failures, where the corrupted computation continues to make progress but the computation results cannot be trusted any more, are usually tolerated offline by checking the computation results after the computation finishes. This project designs novel online fault tolerance techniques to detect fail-continue failures in the middle of the computation so that better efficiency can be achieved by stopping the corrupted computations in the middle of the computation in a timely manner.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 1305622
Program Officer: Almadena Chtchelkanova

Project Start
Project End
Budget Start: 2012-09-01
Budget End: 2016-07-31
Support Year
Fiscal Year: 2013
Total Cost: $340,913
Indirect Cost

SHF: Small: FTLA: Fault Tolerant Linear Algebra Software for Massively Parallel Architectures
Chen, Zizhong
University of California Riverside, Riverside, CA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments