While technology advances allow researchers to produce chips with higher performance and lower power consumption, our ability to deliver such computational power is challenged by the increasing susceptibility of silicon devices to faults. It is expected that in future computer systems, faults will occur in a continuous manner, across all levels from hardware to application. The fault behavior is furthermore expected to be more diverse and unpredictable. Of critical concern will be not only permanent and transient faults, but also intermittent faults that occur frequently and irregularly over nanosecond to second time scales.

These predicted high fault rates and diverse fault behaviors mandate a transformation in fault resilience approaches. When faults occur in a continuous manner, both fault detection and recovery must be performed in a much finer-grained manner, and recovery becomes as critical as detection. Moreover, since fault duration varies significantly, cost-effective solutions capable of uniformly detecting all types of faults, identifying the fault type, and then adaptively recovering the execution are necessary.

To address these reliability challenges, the proposed project will incorporate fine-grained adaptivity into the system, and couple statically extracted application information with runtime optimizations to guide adaptation decisions. The proposed research includes: (1) adaptive detection and checkpointing, capable of adjusting detection and checkpointing granularity to match system reliability levels; (2) adaptive recovery, capable of performing re-execution in a way that minimizes the chance of another fault occurring; and (3) adaptive resource management, capable of monitoring application and hardware reliability levels and quickly adapting scheduling decisions.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1253733
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2013-06-01
Budget End
2019-05-31
Support Year
Fiscal Year
2012
Total Cost
$481,541
Indirect Cost
Name
University of Delaware
Department
Type
DUNS #
City
Newark
State
DE
Country
United States
Zip Code
19716