With the increasing availability of message based distributed parallel computing systems, the difficulties associated with constructing parallel algorithms and mapping them onto parallel architectures have grown in importance. Currently, the algorithm development and programming process typically includes the explicit mapping of algorithm subtasks to processors. This problem is typically dealt with manually and often results in inefficient computation and communication load balancing which in turn leads to an overall reduction in the speedups potentially obtainable. The problem is exacerbated when distributed architectures are considered where processors and/or communications links are subject to failure, yet must continue processing at reasonable rates with the remaining computing resources. This requires that process and data checkpointing be done effectively and that, on detection and location of failures, the system be reconfigured and tasks reallocated. This work focuses on the checkpointing and task reallocation problems. This research program is aimed at investigating approaches to allocating tasks to processors in distributed computer systems whose reliability characteristics have been characterized. The goal is to ensure effective use of the remaining distributed resources after failure (processor or link) has occurred through the development of distributed checkpointing and fast task reallocation schemes.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
9021041
Program Officer
Yechezkel Zalcstein
Project Start
Project End
Budget Start
1991-05-15
Budget End
1995-04-30
Support Year
Fiscal Year
1990
Total Cost
$433,467
Indirect Cost
Name
Washington University
Department
Type
DUNS #
City
Saint Louis
State
MO
Country
United States
Zip Code
63130