In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart(C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC.
The objective of this work to the develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience.
This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources.
In High-End Computing (HEC), faults have become the norm rather than the exception for parallel computation on clusters with 10s/100s of thousands of cores. As the core count increases, so does the overhead for fault-tolerant techniques relying on checkpoint/restart (C/R) mechanisms. At 50% overheads, redundancy is a viable alternative to fault recovery and actually scales, which makes the approach attractive for HEC. The objective of this work to develop a synergistic approach by combining C/R-based fault tolerance with redundancy in HEC installations to achieve high levels of resilience. This work alleviates scalability limitations of current fault tolerant practices. It contributes to fault modeling as well as fault detection and recovery in significantly advancing existing techniques by controlling levels of redundancy and checkpointing intervals in the presence of faults. It is transformative in providing a model where users select a target failure probability at the price of using additional resources. Our work shows that redundancy-based fault tolerance can be used in synergy with checkpoint/restart-based fault tolerance to achieve better application performance for large-scale HPC applications than can be achieved by any of the two techniques alone, which has been analytically modeled and experimentally confirmed. We further assessed the feasibility and effectiveness of SDC detection and correction at the MPI layer via redundancy. We develped two consistency protocols, explored the unique challenges in creating a deterministic MPI environment for replication purposes, investigated the effects of fault injection in to our framework, analyzed the costs and showed the benefits of SDC protection via redundancy. We also studied Single Event Upsets (SEUs) in floating-point data. We show that SEUs produce predictable, non-uniform errors that can be bounded using analytical modeling of perturbed dot-products for elementary linear algebra constructs, and by analyzing convergence theory of first-order (stationary) iterative linear solvers. Convergence for stationary iterative methods is provable, and the performance impact (increased iteration count) of an SEU in data is predictable with low error.