CSR---SMA+AES: Pro-Active Runtime Health Enhancement of Large-Scale Parallel Systems Using PROGNOSIS

Sivasubramaniam, Anand

Abstract

Large scale parallel systems are critical to take on the challenges imposed by highly demanding applications of critical importance. Pushing the limits of hardware and software technologies to extract the maximum performance can increase their susceptibility to failures. This arises as a consequence of growing hardware transient errors, hardware device failures, and software complexity. These failures can have substantial consequences on system performance, and add to the costs of maintenance/operation, thereby putting at risk the very motivation behind deploying these large scale systems. Rather than treat failures as an exception and take reactive remedies, this project intends to anticipate their occurrence and take pro-active runtime measures to hide their impact.

This research is expected to make three broad contributions towards developing a runtime fault-tolerance infrastructure. The first set of contributions is on collecting and analyzing system events from an actual BlueGene/L system over an extended period of time. The second set of contributions are models for online analysis and prediction of evolving failure data. The third set of contributions are on failure-aware parallel job scheduling and checkpointing. On the educational front, in addition to enhancing graduate curriculum and research, this project intends to involve undergraduate students and women. The tools developed in this project and the related results will be made available in public domain and published in leading journals/conferences. In addition, the PIs will also push these tools to be incorporated on actual systems, to enhance their fault-tolerance abilities.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Application #: 0615097
Program Officer: Krishna Kant

Project Start
Project End
Budget Start: 2006-08-15
Budget End: 2011-07-31
Support Year
Fiscal Year: 2006
Total Cost: $356,860
Indirect Cost

CSR---SMA+AES: Pro-Active Runtime Health Enhancement of Large-Scale Parallel Systems Using PROGNOSIS
Sivasubramaniam, Anand
Pennsylvania State University, University Park, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments