For decades, computer system design and operation were driven largely by high performance objectives. Yet, as the large scale integration of semi-conductor devices is approaching its physical limits, energy efficiency and robustness have been recently promoted to first-class design constraints. Energy efficiency is mandated by the emergence of small foot-print, portable, and battery-powered computers as well as ever-increasing power density that puts stringent constraints even on computers connected to the power grid. Moreover, recent research has revealed that aggressive power management techniques can significantly increase vulnerabilities of computer systems to transient faults (soft errors) that can cause incorrect operations at run-time. These problems are even more pronounced for real-time embedded systems that must perform correctly at high reliability levels, under strict timing and energy constraints.

In recent past, a number of pioneering reliability-aware power management schemes were proposed that aim at mitigating the negative effects of the popular dynamic voltage and frequency scaling. This project is addressing the conservatism of the existing solutions and developing a more general framework. Specifically, the project is devising novel solutions to achieve arbitrary reliability levels through the use of shared recovery tasks. In addition, the research is extending the framework to multiprocessor and emerging multicore platforms. The project has two major broader impact dimensions: First, energy-awareness has a direct impact on environment, economy, and society at large. Second, by promoting reliability to a first-order objective, the project will help to prevent malfunctions in safety-critical computer systems and protect property and human lives.

Project Report

Real-time embedded computer systems have to operate under strict timing constraints. With ever-increasing power density, energy efficiency of those systems has become an important dimension. In that regard, Dynamic Voltage Scaling (DVS) which consists in operating the processor at low voltage/frequency is a popular technique that trades energy efficiency for the computation speed. In addition, faults that may affect those systems at run-time may cause incorrect computations and/or unavailability of the computing units. Detecting these faults and executing the necessary recovery operations before the specified deadlines is critical to guarantee the correct execution, and often to prevent system malfunctions that might lead to the loss of property or human lives in safety-critical systems. In this project, we have extended the theory and practice of the energy-efficient and fault-tolerant real-time computing in several directions. We have developed solutions where multiple tasks on the computer may share the same recovery time as needed, improving the efficiency of the existing relibility-aware power management (RA-PM) solutions based on DVS. For multiprocessor real-time systems, we extended the global scheduling based techniques to incorporate recovery tasks (hence, reliability) while keeping an eye on the energy efficiency. We have investigated and analyzed how arbitrary reliability targets can be achieved on multiprocessor systems with the minimum energy consumption, by replicating tasks on multiple processors as necessary. We also conducted an in depth-analysis of the dual-processor energy-efficient standby-sparing systems: our developed solutions address both fixed-priority and dynamic-priority real-time systems, and include both offline and online solutions. In addition, we have conducted an analysis of how the standby-sparing systems could be extended to more than two pocessors with the minimum energy consumption. We also proposed and developed the preference-oriented scheduling framework for multiprocessor systems, that allows executing the primary and backup tasks on the same processor in mixed manner. Finally, we examined and analyzed the settings where the transient faults may occur in bursts and hence may affect multiple computer tasks at once. For those settings, we derived the conditions that must be satisfied in order to still guarantee the timing constraints, when the length of the fault burst does not exceed a given bound. By incorporating reliability as a first-class operational and design objective, the solutions developed by the project have the potential of preventing malfunctions in safety-critical real-time computer systems and protecting property and human lives. Moreover, the solutions are specifically designed to reach those objectives with minimum energy consumption. The project, overall, is likely to have a positive societal impact by contributing to the safety, environment, and economy, simultaneously.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1016855
Program Officer
M. Mimi McClure
Project Start
Project End
Budget Start
2010-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$268,329
Indirect Cost
Name
George Mason University
Department
Type
DUNS #
City
Fairfax
State
VA
Country
United States
Zip Code
22030