In recent decades, microprocessor performance has been increasing exponentially, due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their lower threshold voltages and tighter noise margins make them less reliable, rendering processors that use them more susceptible to transient faults caused by energetic particles striking the chip. Such faults can corrupt computations, crash computers, and cause heavy economic damages. Indeed, Sun Microsystems, Cypress Semiconductor and Hewlett-Packard have all recently acknowledged massive failures at client sites due to transient faults.
This project addresses several basic scientific questions: How does one build software systems that operate on faulty hardware, yet provide ironclad reliability guarantees? For what fault models can these guarantees be provided? Can one prove that a given implementation does indeed tolerate all faults described by the model? Driven in part by the answers to these scientific questions, this project will produce a trustworthy, flexible and efficient computing platform that tolerates transient faults. The multidisciplinary project team will do this by developing: (1) programming language-level reliability specifications so consumers can dictate the level of reliability they need, (2) reliability-preserving compilation and optimization techniques to improve the performance of reliable code but ensure correctness (3) automatic, machine-level verifiers so compiler-generated code can be proven reliable, (4) new software-modulated fault tolerance techniques at the hardware/software boundary to implement the reliability specifications, and finally (5) microarchitectural optimizations that explore trade-offs between reliability, performance, power, and cost.