The drive for increased performance and functionality has pushed computer chips to their physical limits of power/energy and reliability. Future computing systems are likely to suffer from high fault rates, undermining their programmability and usability. This research project will leverage existing techniques and invent new techniques to detect, isolate, and recover from faults and to ensure overall system resilience, with minimal impact on performance. The project taps into the rich body of past work on formal methods, which have been successful in finding logical errors in systems. Newly developed formal methods will focus on resilience enhancement. The project will also explore the inherent trade-offs between performance, power, and resilience. The project will develop an extensible platform for empirical evaluation of resilience methods. This platform will be comprised of programmable chips and accompanying software components.
This project will foster the development of new system design methods which take reliability into account. It will help fill a serious void in readily usable infrastructures for resiliency studies in the realm of parallel systems by developing and releasing tools for evaluating the methods. The project will also develop and release rigorously specified resilience-aware system interfaces. The project emphasizes student training, including student recruitment and introduction of new classes that are integrated with resilience research.