Hardware reliability is becoming an increasing concern in the late CMOS era. Components in shipped chips will fail for many reasons, requiring mechanisms to detect, diagnose, recover from, and repair/reconfigure around these failed components so that the system can provide reliable operation. The pervasiveness of the problem across a broad market demands low-cost and general reliability solutions that can be deployed in general-purpose, commodity systems running applications with varying reliability requirements. Traditional reliability solutions involving excessive redundancy are too expensive, as are piecemeal solutions that address individual failure modes. This work proposes a full system solution that aims to provide a common framework for error detection, diagnosis, recovery, and repair/reconfiguration for a variety of hardware failure modes, with a customizable reliability vs. overhead tradeoff.

Two key high-level observations motivate the approach. First, the hardware reliability solutions need handle only the device faults that propagate through higher levels of the system and become observable to software. Second, in spite of the reliability threat, fault-free operation remains the common case and must be optimized, possibly at the cost of increased overhead once a fault is detected. The proposed system therefore detects faults by watching for anomalous software behavior (or symptoms of faults), using novel zero to low-cost hardware and software monitors. After a fault is detected, it invokes an innovative, but potentially expensive, procedure for diagnosing the fault source to enable reconfiguration/repair (in the case of hard faults). For recovery, it relies on a checkpoint/replay mechanism, including pure hardware and hybrid software assisted recovery depending on detection latency. Coordinating all of the above is a thin firmware layer that provides flexibility and customizability. A major component of the work is a much needed formulation and validation of microarchitecture level fault models, required to drive high-level reliability solutions.

Project Start
Project End
Budget Start
2008-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2008
Total Cost
$500,000
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820