Device physics, manufacturing, and engineering challenges in process scaling are providing signi?cant challenges in producing reliable transistors for future technologies. Many academic experts, industry consortia, and research panels have warned that future generations of silicon technology are likely to be much less reliable with multi-core chips with cores failing in the ?eld due to faults in silicon are around the corner. Concurrently with the reducing reliability, the individual energy ef?ciency of transistors is not keeping up with increase in transistor density. These two trends portend a perfect storm: as the energy ef?ciency of transistors is slowing down, they are becoming highly unpredictable which will force further inef?ciencies. Addressing hardware reliability is a fundamental problem for microprocessors and hence for sustaining the IT revolution. This project looks at mechanisms for allowing chips and the higher levels of software to continue working even when devices fail. The basic idea the project looks at is how to detect when chips fail.
The core idea that this projrct builds upon is the principle of Sampling. Instead of checking for failures all the time, the idea is to use a periodic sampling window for checking for device failures. The project investigates formal models, hardware implementation, and evaluation to understand the effect of device failures and the impact of the detection techniques.