Peta-scale systems are needed to meet the computational demands of distributed simulations. The heat produced by the components in these systems continues to grow despite extensive research in power- and thermal-aware computing. Tight coupling of tens of thousands of these components produces enough heat to increase failure rates and trigger system slowdowns. Dissipating the heat requires costly, dedicated cooling systems. Techniques are needed to reduce the heat produced by large-scale systems.
We are building an infrastructure to enable automated thermal management in advanced execution systems. Specifically, we have two research goals: 1) create a framework for distributed runtime thermal profiling; and 2) create proactive control techniques to reduce distributed application and system thermals on individual high-end components. We are integrating our research in new courses, targeting and recruiting minority students for research activities through the VT MAOP program, and conducting outreach activities to encourage student research in high-end computing.
We are creating technologies that will improve the efficiency and reliability of distributed simulations generally. All of our software tools and techniques will be open source and made available to the public in our website repository. This impacts a broad range of disciplines that perform simulation-based experimentation including computational physics, biology, and chemistry. Additionally, reducing the heat dissipation of large-scale applications will reduce operational costs for computational centers, increase system reliability, and impact the environment indirectly through energy conservation.