In recent years, power and thermal control has become one of the most serious concerns for large-scale data centers that are rapidly expanding the number of hosted servers. In addition to reducing operating costs, precisely controlling power consumption and heat dissipation is an essential way to avoid system failures caused by power capacity overload or overheating due to increasingly high server density (e.g., blade servers). Power and thermal control becomes even more challenging as many data centers start to adopt the virtualization technology for resource sharing, leading to increased utilization and power consumption.

This CAREER project addresses the following research topics. 1) We plan to design and evaluate advanced power and thermal control algorithms, based on feedback control theory, to achieve analytic assurance of control accuracy and system stability. First, we propose novel control algorithms at multiple layers to control power and application performance for virtualized server environments. Second, we propose highly scalable hierarchical algorithms to control the power consumption of an entire large-scale data center. Third, we will design cascaded control algorithms to control heat dissipation and handle thermal emergencies by coordinating with power control loops. 2) We propose power and thermal control middleware. Our middleware will find the optimal coordination strategy for multiple control loops to work together at different layers, and then configure them to achieve the desired control functions. In addition, our middleware can automate the procedure of controller design and analysis for a universal control solution. 3) We will also investigate other components such as hard drives and network switches, as well as the controllability and feasibility problems, in order to provide a complete power and thermal control framework for today's large-scale data centers.

Project Report

On the technical side, the main outcome of this project is the design and development of a coordinated control framework for power, performance, thermal, and cost management in virtualized data centers. Specifically, the framework includes four major components: 1) power control and capping at different levels, 2) performance optimization, 3) thermal monitoring and management, and 4) reduction of the capital and operating expenses. We now introduce our technical outcomes in detail. First, for power control and capping, we have designed several power control algorithms at three different levels of a data center: server, server rack, and the entire data center. All those algorithms were systematically designed based on optimal control theory for theoretically guaranteed control accuracy and system stability. Those algorithms have been published in prestigious research conferences, such as PACT 2009, HPCA 2008, ICS 2011, and ICAC 2011. Their extended journal versions were published in IEEE Transactions on Parallel and Distributed Systems (TPDS) from 2010 to 2012. In the second component, we have systematically designed several performance optimization algorithms based on the recent advances of control theory. Those algorithms were published in several major conferences including RTSS 2008, IWQoS 2009, and IWQoS 2010. Their extended journal versions were published in IEEE TPDS from 2010 to 2012. Third, for thermal monitoring and management, we have designed two intelligent and near-optimal temperature sensor placement algorithms for improved hot server or server component detection based on a systematic Computational Fluid Dynamics (CFD) analysis of the thermal conditions in the data center. These algorithms have been published in ICDCS 2011, IGCC 2012, IEEE TPDS in 2013, and Elsevier Journal of Sustainable Computing: Informatics and Systems (SUSCOM) in 2013. For thermal management, we have designed two power optimization schemes that effectively coordinate liquid cooling, free air cooling, server placement, and dynamically manages workload allocation for jointly optimized cooling and server power. Those studies were published in ICAC 2014 and IGCC 2014. Finally, for the reduction of the capital and operating expenses, we have designed six novel algorithms that can significantly cut the capital expenses (CapEx) and operating expenses (OpEx) of data centers. Those algorithms leverage different equipment such as renewable energy supplies, thermal energy storage devices, PHEVs, and portable containerized modules and were published in Middleware 2011, ICPP 2012, CNSM 2012, IGCC 2013, HPCA 2014, and Performance 2014. Also, the extended journal versions were published in SUSCOM in 2015 and accepted to the IEEE Transactions on Computers (TC). A key difference between our framework and the related work is that our power, performance, and thermal control and management solutions feature a rigorous system design methodology based on recent advances in feedback control theory for analytical assurance of control accuracy and system stability. In addition, our thermal monitoring algorithms are designed based on a systematic Computational Fluid Dynamics (CFD) analysis of the thermal conditions in the data center. This theoretical foundation is in sharp contrast to the current practice that was designed based on oversimplified heuristics. The broader impacts of this project are as follows. First, our power control framework has significantly improved the data center performance while ensuring its power consumption stays safely below the desired power budget. For example, our work published in ICAC 2011 shows that our solution has achieved 38% better server performance, on average, than the state-of-the-art solutions. Likewise, the extensive evaluation results in our ICS 2011 paper show a 23% performance improvement. Those timely designs can significantly help the data center operators to achieve better data center performance with lower power consumption. In addition, our work on thermal monitoring and management can allow data centers to use 26-38% less cooling energy while being able to detect potential hot spots in a timely manner for avoiding undesired server shutdowns. Second, on the education side, we have successfully integrated our research results into several undergraduate and graduate courses taught by the PI, such as ECE 5362, Computer Architecture and Design, ECE 8862, Special Topics in Advanced Computer Design Methodologies, and ECE 655 Power-aware Systems. The research of this project has provided a rich set of examples, design tools, and project opportunities for these courses. The state-of-the-art course materials were well received by our students and shared with other universities that are interested in teaching similar courses. Furthermore, we have supported four Ph.D. students to do dissertations on this topic. Two students have successfully received their Ph.D. degrees and joined Facebook Inc. and Siemens Healthcare as research engineers. They are now applying what they learned from this project to real-world industry products, such as Facebook’s data centers. Furthermore, on the result dissemination, we have presented our results in many conferences, workshops, and seminars. We have also released all the software artifacts developed in this project (including control algorithms, simulators, middleware, and related tools) in the open-source model.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1143607
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2011-08-15
Budget End
2015-01-31
Support Year
Fiscal Year
2011
Total Cost
$395,219
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210