Today's distributed systems are fragile and easily broken. As a result, total cost of ownership is no longer dominated by capital costs. Commodity OS's, middleware, and other software building blocks are being used even to create critical applications such as finance and banking, yet the complexity of the resulting systems is often beyond our understanding.

We propose to build on our prior successful argument that system failures are not problems that will be decisively solved, but ongoing facts of life to be dealt with. Hence, our approach centers on systematic fast automatic detection and recovery from many kinds of failures so fast that failure and recovery will become a form of adaptation, and we will be able to leverage the well-tested ideas of control theory, resulting in a new basis for the design of dependable distributed computing systems. Hence we call our approach RADS: Reliable Adaptive Distributed Systems. We will develop RADS design guidelines and prototypes for creating controllable systems by leveraging existing techniques from Statistical Learning Theory (SLT) and Control Theory (CT). This will enable much wider applicability of SLT + CT to dependable computing and establish a concrete venue for collaboration with those research communities. Although CT and to some degree SLT have been applied in limited ways for monitoring and optimizing performance, to our knowledge ours is the first attempt to use these analytical tools to monitor and control for dependability and high availability.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0509559
Program Officer
Mohamed G. Gouda
Project Start
Project End
Budget Start
2005-09-01
Budget End
2009-08-31
Support Year
Fiscal Year
2005
Total Cost
$1,031,200
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704