Large scale parallel systems are critical to take on the challenges imposed by highly demanding applications of critical importance. Pushing the limits of hardware and software technologies to extract the maximum performance can increase their susceptibility to failures. This arises as a consequence of growing hardware transient errors, hardware device failures, and software complexity. These failures can have substantial consequences on system performance, and add to the costs of maintenance/operation, thereby putting at risk the very motivation behind deploying these large scale systems. Rather than treat failures as an exception and take reactive remedies, this project intends to anticipate their occurrence and take pro-active runtime measures to hide their impact.
This research is expected to make three broad contributions towards developing a runtime fault-tolerance infrastructure. The first set of contributions is on collecting and analyzing system events from an actual BlueGene/L system over an extended period of time. The second set of contributions are models for online analysis and prediction of evolving failure data. The third set of contributions are on failure-aware parallel job scheduling and checkpointing. On the educational front, in addition to enhancing graduate curriculum and research, this project intends to involve undergraduate students and women. The tools developed in this project and the related results will be made available in public domain and published in leading journals/conferences. In addition, the PIs will also push these tools to be incorporated on actual systems, to enhance their fault-tolerance abilities.