The thrust of this project is to extend the underlying theory, enhance, perfect, test, and write applications for the Octopus System which was developed by the PI during 1994-96. Octopus is a uniquely novel software layer that harnesses together the computing power of clustered workstations or PC boards to produce a fault-tolerant, scalable system which provides high throughput and fast turnaround for large parallel computations. Octopus is based on and realizes previous and current fundamental work and is a prime example for theory translated into an actual system. The project continues the fundamental work side by side with the system building work, to the mutual benefit of both, and investigates further applications of randomization. Over the past four years the PI has innovated, in collaboration with others, a theory of asynchronous parallel computations. The Asynchronous Parallel System (APS), consists of a number of processors executing at possibly different rates and addressing a logically (but not necessarily physically) shared memory. The theory showed how to efficiently simulate the execution of parallel programs written for synchronous parallel computers without faults on realistic asynchronous parallel systems in which processors may also fail. The most efficient simulations were developed for large-grained n-thread parallel programs. In such a program most threads execute a substantial block of instructions within each parallel step, and the program variables comprise a large number of memory words (examples are: a row of a large matrix, or a string of keys in a parallel merge-sort). A cluster of workstations or PC boards connected by a high bandwidth switch is one realization of the APS model. The asynchrony arises from the absence of a common driving clock and the fact that each node may be multi-programmed. The Octopus System consists of such a cluster managed by the Octopus layer. The system is up and running and realizes the control, loa d distribution, and fault-tolerance properties of Octopus.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
9700365
Program Officer
Yechezkel Zalcstein
Project Start
Project End
Budget Start
1997-06-15
Budget End
1999-05-31
Support Year
Fiscal Year
1997
Total Cost
$211,646
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138