Recent developments have dramatically increased the importance of ultra-reliable parallel computing. First, parallel processing has moved from a specialized scientific computation tool to an instrument for general-purpose computing. At the same time, computers have begun to control virtually every aspect of human life and are, therefore, increasingly relied upon to provide continuous service. These trends make the development of general but practical techniques for fault-tolerant multiprocessing a necessity. This project focuses on loosely-coupled parallel computers. These systems are ideally suited for ultra-reliable applications due to their built-in hardware redundancy that can be used to achieve fault tolerance. The project has both basic and applied research components. The basic research component studies new models and algorithms for the problems of multiprocessor system fault diagnosis and fault-tolerant routing. The applied research component involves the development of an experimental testbed for fault-tolerant multicomputer systems. A Transputer-based MIMD multicomputer system provides the testbed hardware and low-level operating system. Special-purpose software is developed to provide a system-level fault tolerance framework, a fault simulator, and a data collection tool for the testbed. The testbed allows experimental evaluation of the system-level fault tolerance mechanisms developed in the basic research component and is also made available to other research groups working on multiprocessor system fault tolerance.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
9318495
Program Officer
Yechezkel Zalcstein
Project Start
Project End
Budget Start
1994-07-01
Budget End
1998-06-30
Support Year
Fiscal Year
1993
Total Cost
$207,487
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697