As large-scale, distributed systems are used in safety-critical applications, the fault tolerance of these systems becomes a crucial issue. This research focuses on the fundamental problems of diagnosis and fault-tolerant routing. This work will, for the first time, consider the coordination of these important tasks, by making use of diagnostic information to increase routing efficiency and system throughput. The work will target systems possessing a relatively sparse interconnection structure such as grids, meshes, and hypercubes which constitute a large fraction of existing systems and are expected to dominate future applications. The proposed research will involve the integrated design of algorithms for diagnosis and routing, along with evaluations of algorithm performance that are both theoretical and experimental in nature. The theoretical work will utilize probabilistic analysis techniques in order to evaluate performance under recently proposed probabilistic models as well as to examine performance limits for these systems. The experimental work will involve the implementation and testing of the proposed algorithms on 32-node Intel IPSC/2 hypercube. A product of this research should be efficient algorithms for these fundamental problems in distributed fault tolerance that can be used in a wide range of multiprocessor/multicomputer systems.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
9010547
Program Officer
Yechezkel Zalcstein
Project Start
Project End
Budget Start
1990-08-15
Budget End
1993-07-31
Support Year
Fiscal Year
1990
Total Cost
$53,590
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697