Today's distributed systems are fragile and easily broken. As a result, total cost of ownership is no longer dominated by capital costs. Commodity OS's, middleware, and other software building blocks are being used even to create critical applications such as finance and banking, yet the complexity of the resulting systems is often beyond our understanding.
We propose to build on our prior successful argument that system failures are not problems that will be decisively solved, but ongoing facts of life to be dealt with. Hence, our approach centers on systematic fast automatic detection and recovery from many kinds of failures so fast that failure and recovery will become a form of adaptation, and we will be able to leverage the well-tested ideas of control theory, resulting in a new basis for the design of dependable distributed computing systems. Hence we call our approach RADS: Reliable Adaptive Distributed Systems. We will develop RADS design guidelines and prototypes for creating controllable systems by leveraging existing techniques from Statistical Learning Theory (SLT) and Control Theory (CT). This will enable much wider applicability of SLT + CT to dependable computing and establish a concrete venue for collaboration with those research communities. Although CT and to some degree SLT have been applied in limited ways for monitoring and optimizing performance, to our knowledge ours is the first attempt to use these analytical tools to monitor and control for dependability and high availability.