CSR/AES: Enhancing Application Robustness via Adaptive and Cooperative Methods

Lan, Zhiling

Abstract

As the scale of high performance computing continues to grow, application robustness becomes increasingly important. Checkpointing is the conventional method for fault tolerance. However, it only deals with failures after their occurrence through rollback. In case of one process failure, all processes including non-faulty processes have to be restarted from the previously saved state prior to the failure. Thus, significant performance loss can be incurred due to the work loss and failure recovery. Proactive approaches take preventive actions (e.g. preemptive process migration) before failures, thereby avoiding failures with low cost. Nevertheless, its effectiveness relies on perfect fault prediction, which is hardly achievable in practice.

This project investigates a new approach called adaptive fault management by intelligently integrating proactive and reactive robustness techniques such that it will enable applications to avoid anticipated faults if possible, and in the case of unforeseeable faults, to tolerate these faults in such a way that their impact is kept to a minimum. The project consists of three major components: (1) cooperative anomaly diagnosis (CAD) to improve fault prediction in large-scale systems by developing meta-learning methods; (2) adaptive control manager (ACM) to allow runtime decision making in response to imperfect fault prediction; and (3) integrated runtime support (IRS) to enable cost-effective coordination of fault handing techniques at runtime. The resulting framework will enhance robustness of high performance computing applications by improving their performance in the presence of failures. This project also enhances the systems-area curriculum at Illinois Institute of Technology and helps train the future-generation scientific computing workforce.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Type: Standard Grant (Standard)
Application #: 0720549
Program Officer: Krishna Kant

Project Start
Project End
Budget Start: 2007-08-01
Budget End: 2011-07-31
Support Year
Fiscal Year: 2007
Total Cost: $212,000
Indirect Cost

CSR/AES: Enhancing Application Robustness via Adaptive and Cooperative Methods
Lan, Zhiling
Illinois Institute of Technology, Chicago, IL, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments