Microprocessor performance has been increasing exponentially due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their lower threshold voltages and tighter noise margins make them less reliable, rendering processors that use them more susceptible to transient faults. While many fault-tolerance techniques have been proposed for high-end systems, the high hardware costs of these solutions make them impractical for the desktop and embedded computing markets.

This work develops the concept of software-modulated fault tolerance (SMFT) to reduce the cost of reliability by taking advantage of naturally occurring non-uniformity in programs. By letting the system, the programmer, or even the user decide when and how to apply protection, the impact of fault tolerance can be adapted to best suit the needs of the constantly varying system. By increasing reliability only when warranted, SMFT frees up resources to either increase performance or reduce power. With the development of a set of profiler, compiler, and language techniques, this work allows designers to continue scaling processor performance for all markets despite the presence of transient faults.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0615250
Program Officer
Mohamed G. Gouda
Project Start
Project End
Budget Start
2006-07-01
Budget End
2009-06-30
Support Year
Fiscal Year
2006
Total Cost
$320,000
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08540