Graphics Processing Units (GPUs) are becoming the default choice for general-purpose hardware acceleration because of their ability to enable orders of magnitude faster and energy-efficient execution for large-scale high-performance computing applications. Since the majority of such applications executing on large-scale HPC systems are long-running, it is very important that they cope with a variety of hardware- and software-based faults. Many prior works have shown that real HPC systems are vulnerable to soft errors. An absence of essential protection and checkpointing mechanisms can lead to lower scientific productivity, operational efficiency, and even monetary loss. However, these protection mechanisms (e.g., error correction codes) are themselves not free -- they incur very high performance, energy, and area costs. 

This project takes a holistic approach to explore the avenues to reduce these protection overheads by taking advantage of the fact that all errors do not lead to an unacceptable loss in the accuracy of application output. Prior results show that GPGPU applications are amenable to such accuracy-aware optimizations. In order to enable these optimizations, this project will address three major research questions: a) What hardware/software support and tools are necessary to determine which instructions are not vulnerable to soft errors, b) Based on this analysis, which hardware component(s) need not be protected and for how long, while not sacrificing application quality beyond the user's quality requirements, and c) What optimizations in terms of resource management and scheduling are necessary to make low-overhead but reliable computation more effective and efficient. These questions will be explored via a variety of GPGPU applications emerging from the areas of high-performance computing (HPC), big-data analytics, machine learning, and graphics. If successful, this project will generate several novel research insights that will play an important role in enabling low-cost reliable GPU computing. The results of this project will be integrated into the existing and new undergraduate and graduate courses on computer architecture and reliability, which will facilitate in training students, including women and students from diverse backgrounds and minority groups. 

Project Start
Project End
Budget Start
2017-08-01
Budget End
2021-07-31
Support Year
Fiscal Year
2017
Total Cost
$449,999
Indirect Cost
Name
College of William and Mary
Department
Type
DUNS #
City
Williamsburg
State
VA
Country
United States
Zip Code
23187