This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
Due to the dynamic nature and unprecedented scale of the Internet, Internet services pose challenges including scalability, reliability, and availability to underlying networked systems. This CAREER project concentrates on building Internet services that are resilient to those challenges with machine learning and control techniques. Internet services build upon cluster-based computer systems that keep growing in scale and complexity. Such systems become so complicated that it is even a big challenge to get a good understanding of the entire system dynamic behaviors. The investigators take an analytical and organized approach to design an autonomous software infrastructure on networked systems for building resilient Internet services. The project builds empirical models using statistical learning to help overcome the challenges of scale and complexity in networked systems. It designs coordinated admission control and capacity planning algorithms with end-to-end quality-of-service on multi-tier clusters. Model-independent control techniques are used with empirical models to allocate resources and to dynamically reconfigure the system for performance optimization needs. It develops performance differentiation, isolation, and self-adaptive reconfiguration capabilities for enhancing system reliability and availability. It broadens the research impact by developing a testbed in a data center lab to demonstrate the orchestration of designed techniques for automated arrangement, coordination, and management of complex computer systems, middleware, and services. The research results will be disseminated to the public as technical reports. This project also supports a new inter-disciplinary Ph.D. of Engineering in security program.