Virtually all fields of science and engineering depend on fundamental advances in computing. High- End Computing (HEC) simulations in various areas of science enable to understand the world around us. Unfortunately, HEC is known for lack of sustained performance and reliability. Its system-wide failure rate increases significantly with the growing number of components. The conventional method for fault tolerance in HEC, checkpointing, is costly and triggers a cycle of deterioration. This deterioration is fueled by ever-increasing HEC complexity. A new fault tolerant approach is a must for next generation HEC. In this research, the PIs propose a novel Hybrid Fault Tolerant (HFT) approach for HEC that combines long-term and short-term techniques to improve fault management. Long-term prediction models the possibility of faults based on historical data, and consequently facilitates failure-aware scheduling by intelligently mapping jobs to available resources. Short-term prediction diagnoses the root causes of unusual runtime events, and triggers job rescheduling on-the-fly to move running jobs away from these troublesome resources. The long-term support and the short-term support complement each other, where failure-aware scheduling prevents inactive jobs (i.e. the jobs that are not scheduled yet) from the failures that are well captured in the long-term failure models and failure-aware rescheduling enables active jobs (i.e. the jobs that are already scheduled and running) to avoid irregular failures that may not follow any long-term pattern but can be discovered at runtime (e.g. sudden hardware and software errors). The integrated long-term and short-term approach promotes a better understanding of failure trends and modes and consequently improves system productivity in HEC.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
0702737
Program Officer
Almadena Y. Chtchelkanova
Project Start
Project End
Budget Start
2007-09-15
Budget End
2010-08-31
Support Year
Fiscal Year
2007
Total Cost
$200,000
Indirect Cost
Name
Illinois Institute of Technology
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60616