As our reliance on IT continues to increase, future applications will involve the processing of massive amounts of data and will require an exascale computing infrastructure in which the number of computing, communications and storage elements will increase by several orders of magnitude. Such an infrastructure will inevitably incorporate new classes of high density, low latency and low power non-volatile memory. This, in turn, will increase by orders of magnitude the rate of failures making resiliency a major concern.

This project addresses this resiliency challenge by taking a radical approach to fault-tolerance, which goes beyond the current approach of checkpointing and rollback recover. It introduces innovative and scalable fault-tolerance mechanisms, namely shadow-computing and quality-of-data (QoD) aware replication, as building blocks for a ?tunable? resiliency framework that leverage the new and emerging memory technology and takes into consideration the nature of the data and the requirements of the underlying application.

It is expected that the project will lead to new insights into the multi-faceted and challenging resiliency problem in exascale computing platforms. The expected outcomes of the project are a new fault-tolerance computational model and a suite of QoD-aware replication methods that, when combined with storage level resiliency, will lead to high availability with minimized access delay in exascale computing environments.

The project seeks to involve graduate and undergraduate students in all its research thrusts. In addition to their contributions in the research activities, involved students also participate fully in outreach, dissemination and community efforts activities. The project also seeks to leverage existing collaboration with industrial partners to involve students in summer internships and provide them with first hand exposure to research and development in an industrial setting. A main objective of the recruiting effort is to seek the involvement of students from minorities and under-represented groups in the project.

Project Report

A new resilience mechanism is proposed for both High Performance Computing (HPC) and Cloud Computing environments. Maximizing throughput is the main objective of the former while satisfying Service Level Agreements (SLAs) become a critical aspect of the latter. As the demand for HPC and cloud computing accelerates, the underlying infrastructure is expected to ensure performance, reliability and cost-effectiveness, even with multifold increase in the number of computing, storage and communication components. Current resilience approaches rely upon either time or hardware redundancy in order to tolerate failure. The first approach, which rolls back the computation and re-executes after a failure, is subject to a significant delay. The second approach exploits hardware redundancy and executes multiple instances of the same task in parallel to overcome failure and guarantee that at least one replica reaches completion. This solution, however, increases the energy consumption for a given service, which in turn might outweigh the profit gained by providing the service. The trade-off between performance, fault-tolerance and power consumption calls for new frameworks which is energy and performance aware when dealing with failures. To this end, we introduce Shadow Computing to address the above trade-off challenge. The basic tenet of Shadow Computing is to associate with each main process a suite of "shadows" whose sizes depend on the "criticality" of the application and its performance requirements. Similar to traditional process replication, Shadow Computing ensures successful task completion by concurrently running multiple instances (processes) of the same task. Contrary to the traditional approach, however, Shadow Computing executes the main process of the task at the speed required to maximize profit and slows down the execution of the shadow processes to save energy. Slowing down the shadows can be achieved by either reducing the voltage/frequency of the processor or by co-locating multiple shadows on the same process. Adjusting the speed of execution enables a parameterized trade-off between response time, energy consumption and hardware redundancy. This allows for the optimization of throughput in case of HPC, and the maximization of the expected profit by accounting for income, potential penalties and energy cost, in the case of Cloud Computing. Results from sensitivity study using both analytical models and simulation show that Shadow Computing can achieve significant energy savings (up to 30%) and profit gains (up to 19%) compared to traditional process replication and checkpointing-restart, without violating the resilience or SLA constraints.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1252306
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2013-01-01
Budget End
2015-12-31
Support Year
Fiscal Year
2012
Total Cost
$299,800
Indirect Cost
Name
University of Pittsburgh
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15260