This project explores hardware and software techniques, as well as programming model considerations, for enabling flexible and dynamic soft error protection. Current hardware error-tolerance mechanisms take an all-or-nothing approach: the user either prepays for worst-case error protection or relies entirely on software to tolerate errors. Neither option is desirable for many applications and usage scenarios, however. Each approach results in wasted resources and squandered performance. The goal is to enable cooperative protection schemes and maximize efficiency by blurring the lines between the hardware and software control of soft error protection. The PI will investigate techniques that allow the hardware designer, software system, and programmer to make optimal decisions for their usage model. The aim is to make error tolerance a first-class optimization option in order to allow users to "pay for the error tolerance they need, rather than overpay for what they might need". Applicable scenarios will be studied and analyzed to gauge the advantages to overall system design, effective memory capacity, and performance/power. The PI will also develop methods for the programmer to explicitly yet abstractly and intuitively express trade off between protection and hardware resources. The approach is based on a new mechanism called ``limited guaranteed precision", which allows a programmer to selectively protect costly or sensitive portions of a computation.

To achieve long-term impact, materials to train developers to realize the benefits of treating reliability as a first-class application property will be developed. The education plan revolves around course modules, problems, and demonstrations at levels ranging from popular and mini-talks suitable for high-school and middle-school, through lower and upper division undergraduate course modules, to in-depth graduate-level study. The project will also introduce students who are not computer scientists/engineers to scientific computing and systems and train them, thereby increasing US high-end computing competitiveness. The outcome of this research can impact related fields, industry, and society at large by maintaining advances in computational tools for science and engineering.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0954107
Program Officer
Hong Jiang
Project Start
Project End
Budget Start
2010-03-15
Budget End
2015-02-28
Support Year
Fiscal Year
2009
Total Cost
$360,306
Indirect Cost
Name
University of Texas Austin
Department
Type
DUNS #
City
Austin
State
TX
Country
United States
Zip Code
78712