Internet services are rapidly becoming an integral part of people's work and leisure all over the world. With such popularity and importance comes the need for 24x7 availability. Yet, recent research suggests that, at best, Internet services are achieving only 99-99.9% availability, implying 8 to 80 hours of downtime per year.
This project addresses a major source of unavailability: operator mistakes. Specifically, the PIs seek to reduce the impact of mistakes by guiding and validating operator actions, focusing specifically on cluster-based Internet services. This approach involves three efforts: (1) to explore the nature of operator mistakes and their impact on the performance and availability of Internet services by interviewing and surveying experienced operators, running experiments with volunteer operators, and running several operator contests; (2) to develop operator models that can be used to guide operator actions based on the likelihood of mistakes; and (3) to design and prototype a validation infrastructure that is part of the online system, yet allows operators to check the correctness of their actions before they can impact the live service.
The main expected outcome of the project is a demonstration that systems that guide and validate operator actions can significantly reduce the impact of operator mistakes on service availability. Valuable artifacts for other research efforts relating to availability (and manageability) will include: (1) extensive data on operator mistakes and their impact on service performance and availability; (2) models of operator behavior for guiding actions; and (3) a prototype validation infrastructure.