In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart(C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC.

The objective of this work to the develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience.

This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources.

Project Report

In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart (C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC. The objective of this work to develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience. This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources. Our work shows that redundancy-based fault tolerance can be used in synergy with checkpoint/restart-based fault tolerance to achieve better application performance for large-scale HPC applications than can be achieved by any of the two techniques alone, which has been analytically modeled and experimentally confirmed. We further assessed the feasibility and effectiveness of SDC detection and correction at the MPI layer via redundancy. We develped two consistency protocols, explored the unique challenges in creating a deterministic MPI environment for replication purposes, investigated the effects of fault injection in to our framework, analyzed the costs and showed the benefits of SDC protection via redundancy. We also studied Single Event Upsets (SEUs) in floating-point data. We show that SEUs produce predictable, non-uniform errors that can be bounded using analytical modeling of perturbed dot-products for elementary linear algebra constructs, and by analyzing convergence theory of first-order (stationary) iterative linear solvers. Convergence for stationary iterative methods is provable, and the performance impact (increased iteration count) of an SEU in data is predictable with low error.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1058779
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2010-10-01
Budget End
2016-09-30
Support Year
Fiscal Year
2010
Total Cost
$376,219
Indirect Cost
Name
North Carolina State University Raleigh
Department
Type
DUNS #
City
Raleigh
State
NC
Country
United States
Zip Code
27695