The goal of this project is to develop flexible and efficient Runtime and Compiler System (RCS) technologies to cost-effectively detect and recover from hardware faults in upcoming multicore chips. Semiconductor variations, temperature hot spots, soft errors and aging will make hardware reliability one of the central concerns in the design of multicore processors. RCS technologies will make it possible to meet this challenge because of their flexibility, low cost and ability to target errors that affect program outcome.
Two important objectives of this project are: (1) to avoid full instruction replication within or across threads this is key to acceptance in the energy- and cost-conscious commodity markets and (2) to provide knobs to select the desired performance vs. error-coverage tradeoff.
A prototype, SoftCheck, will be implemented for evaluation purposes. A wide range of novel, cost-effective fault detection and correction techniques will be designed and implemented in SoftCheck. The fault-detection techniques will include: (i) exhaustive self-checking, (ii) partial self-checking, (iii) partial cross-thread checking in a multicore environment, and (iv) other cross-cutting, often multiprocessor-related, approaches. The fault-correction techniques include: (i) disabling clusters in a core (ii) disabling complete cores, and (iii) dynamic recompilation to use other hardware.