ITR/NGS: Deja Vu: Transparent Checkpointing and Migration of Parallel Codes Over Grid Infrastructures

Bernal, Fredric; Webb, Sidney; LaMar, Eric

Abstract

A daunting challenge is the evolution from today's computational Grid to a true cyberinfrastructure that seamlessly integrates resources ranging from small clusters in academic laboratories to the largest national supercomputing centers and provides ubiquitous access to high performance computing, research instrumentation, data warehouses and visualization. Realization of this future requires fundamental advances in transparent fault recovery mechanisms to mask component failures endemic to any large-scale computational resource. While previous generations of supercomputers engineered reliability into systems hardware, today's high performance computing (HPC) environments are based on clusters of COTS components, with no systemic solution for the reliability of the resource as a whole. Engendering stability in ever growing networked collections of cluster systems needs a software solution that provides reliable access to computing resources through transparent, efficient, and automatic checkpointing and recovery (CPR) mechanisms. This project aims to bring about this future through radically new approaches to longstanding problems in CPR and process migration by building an integrated system called Deja vu. Deja vu provides (a) a transparent parallel checkpointing and recovery mechanism that recovers from any combination of systems failures without any modification to parallel applications. (b) a novel post-compiler analysis system that transparently captures application state, (c) a systems architecture that seamlessly integrates user-initiated and system-initiated checkpoints in a single framework enabling the effective use of a wide variety of domain specific knowledge, (d) novel runtime mechanisms for transparent incremental checkpointing, to efficiently capture the least amount of state required to maintain global consistency, (e) a novel communications architecture that enables transparent migration of existing MPI/PVM codes without source-code modifications to either the application or the MPI/PVM libraries, (f) recoverable IO subsystems that can be tailored to specific storage environments, and (g) interfaces to and augmentation of the Globus Toolkit to effectively use the CPR and migration capabilities provided by this research. The core CPR and migration facilities of Deja vu will be surrounded by management, security, and scheduling facilities that (a) integrate with local scheduling systems (e.g., OpenPBS) and accounting systems for site-specific accounting and refunding of lost compute cycles and (b) extend the Globus security architecture with fine grain rights and dynamically created user accounts that allow the fluid resource control available under the Deja vu system to be fully exploited. The design goal of this project is not just to implement "point" solutions, but an integrated system that will constitute a fundamental component of both large-scale computing facilities and Grid infrastructures. Our research team (VT, PSC, ISR) has considerable experience in the design, development, deployment and support of complete solutions.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Application #: 0325540
Program Officer: Anita J. LaSalle

Project Start
Project End
Budget Start: 2004-04-15
Budget End: 2009-03-31
Support Year
Fiscal Year: 2003
Total Cost: $86,750
Indirect Cost

ITR/NGS: Deja Vu: Transparent Checkpointing and Migration of Parallel Codes Over Grid Infrastructures
Bernal, Fredric Webb, Sidney LaMar, Eric
West Virginia High Technology Consortium Foundation, Fairmont, WV, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments