The objective of this project is to systematically design means of obtaining, tandardizing, and manipulating quantified Reliability, Availability and Serviceability (RAS) information from extreme-scale High Performance Computing (HPC) distributions, and to develop a novel, scalable framework for the real-time RAS monitoring and modeling of these systems via the research and creation of an optimal feedback control loop encompassing the entire computational environment. This work is necessitated by the continual and substantial increase in the size and scope of HPC systems, which is causing rapid inflation in the number of faults, errors, and other performance interruptions encountered by these machines.

As HPC systems move towards the petaflop era, a greater focus must be placed on the performance interruptions encountered by these machines, and the development of means by which they may continue uninterrupted computation. In this extreme-scale environment, efforts aimed towards maintaining high reliability and uptime are futile ? with their enormous processor and computational unit counts, these systems will inevitably encounter performance issues, and failure must be expected. This project aims to 1) research and develop advanced, standardized methodologies for gathering application- and system-level data and generating quantifiable RAS metrics, 2) provide a novel, scalable solution for improving accuracy in reliably predicting imminent node-wise and system failures in large-scale systems, and 3) devise defensive and proactive techniques for reducing the computational costs required to timely and accurately handle resilience issues and model system health. In summary, this work attempts to alleviate the time and cost limitations of contemporary, reactive fault tolerance schemes, and will advance the development of scalable, proactive, and intelligent resilience provision in large-scale computing deployments. In addition, the Resilience Consortium will be established to synergistically research and develop, share data and findings, and disseminate knowledge to the public.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0834483
Program Officer
Krishna Kant
Project Start
Project End
Budget Start
2008-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2008
Total Cost
$300,000
Indirect Cost
Name
Louisiana Tech University
Department
Type
DUNS #
City
Ruston
State
LA
Country
United States
Zip Code
71272