Advances in computing power over the past two decades have driven successive generations of powerful supercomputers. Petascale systems have recently emerged that contain tens of thousands of processors. At this scale, frequent component and software faults cause parallel applications to fail often, forcing users to save critical program data (checkpoint) repeatedly at an unsustainable scale and pace, wasting resources and triggering additional faults. This has created a crisis: recent experience with petascale systems reveal that increased checkpoint frequency is inducing additional faults, and the excessive overhead required for checkpointing on peta- and exascale systems is reaching theoretical scaling limits. These problems represent a petascale reliability barrier that prevents the effective use of these systems.

To address this problem, the investigator is pursuing a research and education program to improve the reliability and efficiency of high performance computing systems through a comprehensive approach to fault detection, prediction, response, and recovery. The effort involves work on four fronts: i) investigation of new methods for fault detection and prediction; ii) creation of new algorithms, techniques, and tools to avoid faults by proactively responding to potential faults, and to efficiently recover from faults when they occur; iii) creation of a fault injection framework and architecture testbed to assess and validate fault prediction, detection, and proactive and reactive response mechanisms; and iv) development of a education and training program to disseminate fault-aware practices for systems administrators and application developers. The expected results are: more reliable HPC systems and parallel applications; new fault prediction, detection, and response algorithms, software libraries, and tools; and the establishment of a cohort of students and researchers trained to use fault prediction and response technologies.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0952960
Program Officer
Almadena Y. Chtchelkanova
Project Start
Project End
Budget Start
2010-03-01
Budget End
2015-02-28
Support Year
Fiscal Year
2009
Total Cost
$312,499
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907