A daunting challenge is the evolution from today's computational Grid to a true cyberinfrastructure that seamlessly integrates resources ranging from small clusters in academic laboratories to the largest national supercomputing centers and provides ubiquitous access to high performance computing, research instrumentation, data warehouses and visualization. Realization of this future requires fundamental advances in transparent fault recovery mechanisms to mask component failures endemic to any large-scale computational resource. While previous generations of supercomputers engineered reliability into systems hardware, today's high performance computing (HPC) environments are based on clusters of COTS components, with no systemic solution for the reliability of the resource as a whole. Engendering stability in ever growing networked collections of cluster systems needs a software solution that provides reliable access to computing resources through transparent, efficient, and automatic checkpointing and recovery (CPR) mechanisms. This project aims to bring about this future through radically new approaches to longstanding problems in CPR and process migration by building an integrated system called Deja vu. Deja vu provides (a) a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications. (b) a novel post-compiler analysis system that transparently captures application state, (c) a systems architecture that seamlessly integrates user-initiated and system-initiated checkpoints in a single framework enabling the effective use of a wide variety of domain specific knowledge, (d) novel runtime mechanisms for transparent incremental checkpointing, to efficiently capture the least amount of state required to maintain global consistency, (e) a novel communications architecture that enables transparent migration of existing MPI/PVM codes without source-code modifications to either the application or the MPI/PVM libraries, (f) recoverable IO subsystems that can be tailored to specific storage environments, and (g) interfaces to and augmentation of the Globus Toolkit to effectively use the CPR and migration capabilities provided by this research. The core CPR and migration facilities of Deja vu will be surrounded by management, security, and scheduling facilities that (a) integrate with local scheduling systems (e.g., OpenPBS) and accounting systems for site-specific accounting and refunding of lost compute cycles and (b) extend the Globus security architecture with fine grain rights and dynamically created user accounts that allow the fluid resource control available under the Deja vu system to be fully exploited. The design goal of this project is not just to implement "point" solutions, but an integrated system that will constitute a fundamental component of both large-scale computing facilities and Grid infrastructures. Our research team (VT, PSC, ISR) has considerable experience in the design, development, deployment and support of complete solutions.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0325540
Program Officer
Anita J. LaSalle
Project Start
Project End
Budget Start
2004-04-15
Budget End
2009-03-31
Support Year
Fiscal Year
2003
Total Cost
$86,750
Indirect Cost
Name
West Virginia High Technology Consortium Foundation
Department
Type
DUNS #
City
Fairmont
State
WV
Country
United States
Zip Code
26554