This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
Computational clusters and clusters coalitions continue to grow in scale and in the complexity of their components and interactions. In these systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs is becoming an increasingly important concern to system designers and administrators. The success of petascale computing will depend on the ability to provide dependability at scale. Failure management and failure-aware resource management are crucial techniques for understanding emergent, system-wide phenomena and self-managing resource burdens.
This project investigates a set of innovative techniques on failure-aware monitoring and management for system-level availability assurance. In this project, we will develop a framework along with mechanisms for failure-aware autonomic resource management in large clusters, quantify the temporal and spatial correlations among failure occurrences for proactive failure management, and devise resource allocation and reconfiguration approaches to deal with the system availability and productivity issues caused by component failures that occur frequently in modern large and complex clusters.
Broader impacts of the project include the publication and dissemination of research results and developed software artifacts. The research enables collaborative research opportunities for students and faculty in the program, as well as undergraduate science and engineering students in New Mexico. Research-based materials about dependable high-performance computing will also be instilled into the undergraduate and graduate computer science and engineering curriculum.