Large-scale hosting infrastructures have become important platforms for many real-world systems such as cloud computing, enterprise data centers, massive data analytics, and web hosting services. Unfortunately, today's large-scale hosting infrastructures are still vulnerable to various system anomalies such as performance bottlenecks, resource hotspots, service level objective (SLO) violations, and various software/hardware failures.
The goal of this project is to assess the viability of an online predictive anomaly management solution for large-scale hosting infrastructures. We will develop novel techniques for 1) performing light-weight online system anomaly prediction; 2) providing self-evolving anomaly prediction models to achieve high-quality prediction for real-world dynamic systems; and 3) performing speculative, ``hot" system anomaly diagnosis that search possible anomaly causes and suggest corrective actions while the system approaches the anomaly state. Our research will carry out evaluation by conducting experiments and case studies with our industrial partners on realistic platforms.
Students supported by this project will gain experience with development and testing of robust real-world hosting infrastructures through interactions with our industrial partners, through internships and onsite experimentation. This work will advance diversity by involving students from under-represented groups. Particularly, the prototype developed in this project will be applied to the Virtual Computing Lab (VCL) at NCSU, a platform for providing a better educational experience for K-12, community colleges, and universities.
This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).