Microprocessor performance has been increasing exponentially due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their lower threshold voltages and tighter noise margins make them less reliable, rendering processors that use them more susceptible to transient faults. While many fault-tolerance techniques have been proposed for high-end systems, the high hardware costs of these solutions make them impractical for the desktop and embedded computing markets.
This work develops the concept of software-modulated fault tolerance (SMFT) to reduce the cost of reliability by taking advantage of naturally occurring non-uniformity in programs. By letting the system, the programmer, or even the user decide when and how to apply protection, the impact of fault tolerance can be adapted to best suit the needs of the constantly varying system. By increasing reliability only when warranted, SMFT frees up resources to either increase performance or reduce power. With the development of a set of profiler, compiler, and language techniques, this work allows designers to continue scaling processor performance for all markets despite the presence of transient faults.