The starting point for this research is a fundamental problem in fault-tolerant distributed computing: reaching agreement among processes in a system that is subject to failures. It is well-known that this problem, called Consensus, has no deterministic solution in asynchronous systems, even if it is assumed that communication is reliable and no more than one process may fail. The impossibility of solving Consensus ( and other related problems such as Atomic Broadcast) is one of the most severe obstacles to implementing fault-tolerant applications in asynchronous systems. In recent work the PI has introduced a novel approach to circumvent such impossibility results: he showed that unreliable failure detectors can be used to solve Consensus (and Atomic Broadcast), even if the information that they provide about failures is highly inaccurate, e.g., even if they make an infinite number of mistakes. Since such failure detectors can be implemented in realistic distributed systems, and since various considerations make the asynchronous models especially attractive, this work suggests an approach to fault-tolerance that is viable in practice. The objectives of this research are to broaden the applicability of this approach by removing the limitations of the earlier work, and to explore in more concrete terms its practicability. Specific goals include: (1) tolerating communication failures, including network partitions (the earlier work assumed reliable links); (2) tolerating process failures of various types (the earlier work dealt with crash failures only); (3) considering shared-memory systems (the earlier work dealt with message-passing systems); and (4) solving other problems that are central to fault-tolerant distributed computing, including Group Membership and Group Multicasts (the earlier work solved Consensus and Atomic Broadcast). In order to assess the cost and benefit of using unreliable failure detectors complexity questions are also explored. Finally, the practicality of this approach is validated by implementation on an experimental platform.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
9402896
Program Officer
Anand R. Tripathi
Project Start
Project End
Budget Start
1995-05-01
Budget End
1998-10-31
Support Year
Fiscal Year
1994
Total Cost
$230,000
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850