Computing infrastructure has been a driving force for our socio-economic progress in the past several decades. From drug discovery to space exploration, every scientific and engineering domain relies on computer systems to accurately analyze complex datasets. Historically, computational accuracy has been taken for granted in all these disciplines, but this notion is changing. While rapidly shrinking transistor dimensions lead to exponential power and performance benefits, the trend is also creating several unwanted side effects in computer system reliability. There are two types of errors that will become prevalent in the near future: (1) multi-bit soft errors where alpha particles and neutrons cause multiple bits to flip at the same time, and (2) intermittent errors that occur due to stress accumulation over the lifetime of a computer. Thus it is critical to benchmark the impact of these errors on the lifetime of a computer chip. Only when the impact is accurately measured is it possible to judiciously deploy solutions to improve reliability. Since any protection scheme comes with a cost, it is necessary to understand when a particular protection scheme being considered, such as parity or single-error-correcting double-error-detecting code, is too much or too little.

This project presents two solutions for benchmarking multi-bit soft errors and intermittent errors. This project will develop a unified methodology to benchmark the impacts of single-bit and multi-bit soft errors on caches protected with an arbitrary protection scheme, such as an inter-leaved, block-level or word-level error correcting code. Such a benchmarking framework will significantly enhance a computer designer's ability to objectively evaluate the performance, power, and reliability tradeoffs of various protection schemes proposed for protecting caches.

This research also develops a methodology to benchmark the vulnerability of an instruction set architecture (ISA) to intermittent errors. Each instruction in an ISA specification is enhanced to quantify the amount of stress that it is expected to cause on the underlying microarchitecture of a chip. The stress level information from the ISA is combined with operating conditions of the chip to continuously monitor intermittent error probability during application execution. Any unwanted degradation in chip reliability is then tackled by software exception handlers, which trigger redundant execution of vulnerable code.

Broader societal impact will result from these research solutions. Benchmarking is essential to objectively evaluate the cost-benefit tradeoffs of various solutions currently being proposed to tackle reliability concerns. Without benchmarking, building a system to meet reliability specifications is a guessing game. By providing the right set of tools to initiate just-in-time error correction and recovery mechanisms, a computer designer can significantly lower the cost of providing reliable computations.

Project Start
Project End
Budget Start
2012-08-01
Budget End
2016-07-31
Support Year
Fiscal Year
2012
Total Cost
$399,999
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089