Parallel and distributed computing systems are often a heterogeneous mix of machines. As these systems continue to expand rapidly in capability, their computational energy expenditure has skyrocketed, requiring elaborate cooling facilities, which themselves consume significant energy. The need for energy-efficient resource management is thus paramount. Moreover, these systems frequently experience degraded performance and high power consumption due to circumstances that change unpredictably, such as thermal hotspots caused by load imbalances or sudden machine failures. As the complexity of systems grows, so does the importance of making system operation robust against these uncertainties. The goal of this award is to study stochastic-based models, metrics, and algorithmic strategies for deriving resource allocations that are energy-efficient and robust. The research focus is on deriving stochastic robustness and energy models from real-world data from heterogeneous computing machines; applying stochastic models for resource management strategies that co-optimize performance, robustness, computation energy, and cooling energy; developing novel schemes for real-time thermal modeling; and driving and validating the research with feedback collected from real-world petascale systems (Yellowstone at National Center of Atmospheric Research and Jaguar at Oak Ridge National Lab) and terascale systems (Colorado State University's ISTeC cluster and clusters at Oak Ridge National Lab).

The research is expected to realize resource management strategies that are resilient to various sources of uncertainty at run-time while also considering the dynamics of temperature variations and cooling capacity to meet performance guarantees with unprecedented gains in system energy-efficiency in high performance computing environments. By lowering the energy costs and impact of uncertainties associated with computing, this research will ultimately render high performance computing accessible to a wider population of researchers and scientific problems. In the long term, the theoretical foundations and tools that emerge from this research will play a vital role in achieving the grand promise of sustainable computing at extreme scales within realistic power budgets. The broader impacts of the research include: incorporate research results into all levels of teaching, including graduate, undergraduate, secondary, and even elementary education; increase participation by underrepresented groups; and foster close ties with industry and government labs to transfer the developed knowledge quickly into real-world deployments.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
1302693
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2013-05-15
Budget End
2017-12-31
Support Year
Fiscal Year
2013
Total Cost
$850,000
Indirect Cost
Name
Colorado State University-Fort Collins
Department
Type
DUNS #
City
Fort Collins
State
CO
Country
United States
Zip Code
80523