Advanced computing systems --- that support a wide variety of applications in fields such as economics, sciences, and medicine --- are increasingly being designed with energy efficiency considerations. An extant approach to energy management is to run the underlying processors and devices at varying voltage and frequency. Typically, the approaches push the devices to run as fast as possible within thermal limits using the premise that "faster is better or at least does no harm." There is growing evidence in the prevailing literature that "slower is sometimes better." For example, for benchmark applications such as IOZone, it has been observed that running the processors at a faster speed can lead to significant slowdowns in the overall execution time. At large scale, e.g., in the Amazon Web Services cloud, such performance loss can cost hundreds of thousands of dollars in CPU hours and waste precious energy often begotten from polluting fossil fuels. However, isolating the root cause of such slowdowns in today's complex systems at the scale of data centers is akin to finding a needle in a haystack. Performance is now a function of the complex interaction between application design, system resources, and the underlying hardware. Furthermore, power scaling makes the raw performance of the hardware a variable; thus, further confounding attempts to isolate slowdowns.
This project builds novel technologies that identify, model and automate the minimization or elimination of slowdowns in parallel and distributed applications when power scaling is enabled. The key approach is fine-grain application and kernel instrumentation to develop in-depth analysis of the interaction between parallel and distributed applications and the software and hardware stack. The intellectual merit of this research involves three intermediate research goals: 1) Exhaustive testing and deep system and code analysis on a large class of applications and a diverse set of systems to classify and isolate the slowdown phenomenon due to power scaling; 2) Design, implementation, and validation of models of the critical paths of applications exhibiting sensitivity to slowdowns; and 3) Analysis of the resulting models and design, implementation, and validation of the automated, open-source, runtime optimization techniques to steer power scaling to minimize or eliminate slowdowns.
Completion of the project will improve the performance and energy efficiency of advanced systems. Adoption of the resulting runtime tools will enable use of power scaling to save energy while simultaneously reducing time-to-solution for modern applications and systems. The resulting artifacts and technologies will contribute to U.S. competitiveness by addressing the challenge of building large-scale systems within power constraints. The educational activities will help produce diverse graduates with highly marketable skill sets. The integration of the research discoveries and software tools, which will be open source and made public, into the educational curriculum will help capture the interest of the next generation of computer scientists.