Fault-tolerance is now a primary design constraint for all major microprocessors; however, perfect fault-tolerance is not a requirement for most designs. Instead, designs strive to maximize performance subject to an acceptable failure rate constraint. Therefore, vendors typically set a failure rate (FIT) target for each design and validate that the design meets this target with extensive pre-silicon and post-silicon analysis. One method to quantify fault masking is to use vulnerability factors. A system consists of multiple independent components that interact through well-defined interfaces. Therefore, fault masking can be quantified within a single component by focusing on its interfaces. This abstraction is called the "vulnerability stack", and is the major focus of this project.

The vulnerability stack can have immediate tangible benefits to the Computer Architecture community. First, by enabling independent vulnerability assessment of each system component, the vulnerability stack allows a designer to assess (and potentially improve) the fault-tolerance of a particular component (e.g., a user program). This enables a much broader segment of the Computer Architecture and Software Engineering communities to participate in the vulnerability assessment and remediation process; currently, these activities are typically performed by architects equipped with a microarchitectural model. A second benefit of the vulnerability stack is a substantial reduction in the overall effort required for vulnerability assessment. A third benefit of the vulnerability stack is its application to runtime vulnerability estimation techniques. These are of interest because they allow a system to dynamically tune redundancy features to match the current vulnerability environment; this can improve performance during periods of low vulnerability.

This project will impact undergraduate and graduate education by introducing vulnerability concepts in the Computer Architecture curriculum at Northeastern University and deliver a tutorial at a major Computer Architecture conference. The project will also include participation by under-represented groups.

Project Report

Reliability in computing systems has become a first-class design constraint for major microprocessor designers and and software developers. Current methods to measure system vulnerability treat a computer system as a monolithic entity. There may be significant opportunities to reduce the vulnerability of a system to faults if we are able to decompose vulnerability into multiple hardware and software components. The first step is to provide tools to allow assessment of vulnerabiliity at different levels of abstraction. In this project we have focused on developing a System Vulnerability Stack that allows separate calculation of the vulnerability of individual system components. These components can then be combined in a system-specific manner to measure overall system vulnerability. This project pursued a novel approach to addressing system vulnerability by exploiting the fact that a system consists of multiple independent components (e.g., microarchitecture, virtual machine, user programs). The major results include: 1) a new understanding of the vulnerability of graphics processors, 2) a new compiler infrastructure that supports analysis of reliablity, 3) a new methodology to reason about the impact of multi-bit errors, and 4) a better understanding of the interaction between hardware and software reliability. The project developed new analytical methods to reason about reliability across these traditional boundaries. The work has produced a number of important publications. The research team has interacted heavily with AMD Research on this project, which should help to ensure technology transfer out of this research. The project has engaged both doctoral students and undergraduate students working in reliabiltiy, compilation and simulation. The project has also impacted the design and implementation of the Multi2Sim simulator, the main tool used in the analysis of GPU reliability in this project.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1017439
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$349,999
Indirect Cost
Name
Northeastern University
Department
Type
DUNS #
City
Boston
State
MA
Country
United States
Zip Code
02115