Large-scale shared hosting infrastructures such as multi-tenant cloud computing systems have become increasingly popular by allowing users to lease resources on-demand in a cost-effective way. As multiple tenants may share computing resources, hosting infrastructures are complex systems and prone to various system anomalies. Although software developers often perform rigorous offline testing, many subtle bugs only manifest themselves during large-scale production run. Many anomalies such as those where the system does not crash but fails to behave as expected are hard to reproduce and diagnose using existing techniques. Existing system anomaly diagnosis work can be broadly classified into two categories: 1) the black-box schemes which do not require source code and are suitable for online production-site diagnosis, and 2) the white-box schemes which require source code and expensive code instrumentation and are suitable for development site, offline diagnosis. Although white-box schemes provide fine-grained diagnosis, large-scale production hosting infrastructures are reluctant to adopt them due to their high-overhead and intrusive system recording approaches.

The overarching objective of this project is to explore an innovative cross-site system anomaly debugging approach that intelligently integrates production-site black-box diagnosis with development-site white-box debugging into a more powerful hosting infrastructure debugging framework. This project will develop techniques for development-site, offline white-box debugging that takes the production-site fault inference results as guidance to find the exact anomaly causes. The project will focus on diagnosing non-crashing system anomalies (e.g., performance degradation, service outage, software hang, unexpected halt) that are common in real world hosting infrastructures but are difficult to debug using existing techniques.

Techniques developed in this project will generate significant impact on improving the robustness of real world hosting infrastructures. The PIs will develop new course modules on the hosting infrastructure debugging for both graduate and undergraduate classes they regularly teaches. This project will develop programming courseware based on the research prototypes developed in this project. The PIs will use their power of role model and a set of outreach activities to recruit more female students to pursue systems research. The PIs will disseminate their results and collected data broadly through publication and technology transfer. Developed software artifacts and experimental datasets will be released for public use.

National Science Foundation (NSF)
Division of Computer and Network Systems (CNS)
Application #
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
North Carolina State University Raleigh
United States
Zip Code