Silicon technology underlying the growth in computer performance and functionality over the last several decades is now reaching fundamental physical limits. As this happens, computer hardware is becoming increasingly susceptible to errors. Traditional reliability solutions to avoid such errors rely on indiscriminate redundancy, which is too expensive for emerging systems. A promising approach is to rely on software to provide acceptable resiliency to hardware errors at a much lower cost by using selective redundancy only where needed. A key obstacle to practical adoption of software-driven solutions is that some hardware errors may escape the software stack, leading to unacceptable data corruptions. It is therefore critical to develop analysis techniques that can identify software regions that are potentially vulnerable to hardware errors, and low-cost mitigation or hardening techniques that can make such software regions resilient to data corruption.
This project is to develop a principled and scalable approach to resiliency analysis and hardening for software. The project is based on two observations. First, resiliency analysis is analogous to the problem of software testing, which seeks to find software bugs. Second, resiliency hardening is analogous to software debugging and repair. The work will leverage methods previously used for software testing and debugging to improve resiliency analysis and hardening for diverse computer architectures. It will (1) explore new testing-based techniques to improve the quality and diversity of test inputs used for resiliency analysis; (2) leverage program-analysis and machine-learning methods to make resiliency analysis faster and more accurate for diverse computer architectures; (3) develop formal specifications, optimization strategies, and machine-learning-based methods to harden software using low-cost checkers; and (4) develop techniques to apply resiliency solutions in an incremental and compositional way. The goal is to make the promise of low-cost software-driven approaches to hardware reliability practical by incorporating resiliency analysis and hardening within a modern software-development workflow. The project offers the opportunity for multidisciplinary training of students in the fields of computer architecture, software testing, program analysis, and machine learning, as well as broadening participation in computing through increased recruitment and retention efforts for women and under-represented minorities.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.