Theory and System Building for Asynchronous Parallel Computing

Rabin, Michael

Abstract

The thrust of this project is to extend the underlying theory, enhance, perfect, test, and write applications for the Octopus System which was developed by the PI during 1994-96. Octopus is a uniquely novel software layer that harnesses together the computing power of clustered workstations or PC boards to produce a fault-tolerant, scalable system which provides high throughput and fast turnaround for large parallel computations. Octopus is based on and realizes previous and current fundamental work and is a prime example for theory translated into an actual system. The project continues the fundamental work side by side with the system building work, to the mutual benefit of both, and investigates further applications of randomization. Over the past four years the PI has innovated, in collaboration with others, a theory of asynchronous parallel computations. The Asynchronous Parallel System (APS), consists of a number of processors executing at possibly different rates and addressing a logically (but not necessarily physically) shared memory. The theory showed how to efficiently simulate the execution of parallel programs written for synchronous parallel computers without faults on realistic asynchronous parallel systems in which processors may also fail. The most efficient simulations were developed for large-grained n-thread parallel programs. In such a program most threads execute a substantial block of instructions within each parallel step, and the program variables comprise a large number of memory words (examples are: a row of a large matrix, or a string of keys in a parallel merge-sort). A cluster of workstations or PC boards connected by a high bandwidth switch is one realization of the APS model. The asynchrony arises from the absence of a common driving clock and the fact that each node may be multi-programmed. The Octopus System consists of such a cluster managed by the Octopus layer. The system is up and running and realizes the control, loa d distribution, and fault-tolerance properties of Octopus.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 9700365
Program Officer: Yechezkel Zalcstein

Project Start
Project End
Budget Start: 1997-06-15
Budget End: 1999-05-31
Support Year
Fiscal Year: 1997
Total Cost: $211,646
Indirect Cost

Theory and System Building for Asynchronous Parallel Computing
Rabin, Michael
Harvard University, Cambridge, MA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments