CPA: A Hybrid Fault Tolerant Approach for High-End Computing

Sun, Xian-He; Lan, Zhiling

Abstract

Virtually all fields of science and engineering depend on fundamental advances in computing. High- End Computing (HEC) simulations in various areas of science enable to understand the world around us. Unfortunately, HEC is known for lack of sustained performance and reliability. Its system-wide failure rate increases significantly with the growing number of components. The conventional method for fault tolerance in HEC, checkpointing, is costly and triggers a cycle of deterioration. This deterioration is fueled by ever-increasing HEC complexity. A new fault tolerant approach is a must for next generation HEC. In this research, the PIs propose a novel Hybrid Fault Tolerant (HFT) approach for HEC that combines long-term and short-term techniques to improve fault management. Long-term prediction models the possibility of faults based on historical data, and consequently facilitates failure-aware scheduling by intelligently mapping jobs to available resources. Short-term prediction diagnoses the root causes of unusual runtime events, and triggers job rescheduling on-the-fly to move running jobs away from these troublesome resources. The long-term support and the short-term support complement each other, where failure-aware scheduling prevents inactive jobs (i.e. the jobs that are not scheduled yet) from the failures that are well captured in the long-term failure models and failure-aware rescheduling enables active jobs (i.e. the jobs that are already scheduled and running) to avoid irregular failures that may not follow any long-term pattern but can be discovered at runtime (e.g. sudden hardware and software errors). The integrated long-term and short-term approach promotes a better understanding of failure trends and modes and consequently improves system productivity in HEC.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 0702737
Program Officer: Almadena Y. Chtchelkanova

Project Start
Project End
Budget Start: 2007-09-15
Budget End: 2010-08-31
Support Year
Fiscal Year: 2007
Total Cost: $200,000
Indirect Cost

CPA: A Hybrid Fault Tolerant Approach for High-End Computing
Sun, Xian-He Lan, Zhiling
Illinois Institute of Technology, Chicago, IL, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments