As society increasingly relies on distributed applications, it is ever more important that these applications tolerate partial failures. However, many users and developers are unwilling to sustain the costs in resources and performance that one pays today for building reliable applications. This research seeks to resolve the tension between the need for fault-tolerance and the costs traditionally associated with it. A new low-overhead approach to fault-tolerance, called lightweight fault-tolerance (LiFT), is investigated. The ideas at the core of LiFT are (1) to determine the information necessary to reproduce during recovery each non- deterministic event in a process execution, and (2) to guarantee that this information is available during recovery by efficiently replicating it in the volatile memory of a sufficient number of processes. LiFT has two main goals: (1) the development of low-overhead fault-tolerance techniques that encompass applications in which communication is through message-passing, shared memory, or any combination of the two. (2) the development of recovery protocols that guarantee at the same time good performance during failure-free executions, fast recovery, and fault- containment.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
9734185
Program Officer
Yuan-Chieh Chow
Project Start
Project End
Budget Start
1998-06-15
Budget End
2003-03-31
Support Year
Fiscal Year
1997
Total Cost
$200,000
Indirect Cost
Name
University of Texas Austin
Department
Type
DUNS #
City
Austin
State
TX
Country
United States
Zip Code
78712