This project investigates optimization problems that arise while performing thermal management in very large data storage centers. To satisfy the growing data management needs, such storage centers contain possibly hundreds of thousands of hard disks and other components, and typically are consistently active. These generate a lot of heat, and hence the storage system must be cooled to maintain reliability, resulting in significant cooling costs. The cooling mechanism and the workload assignments in a storage center are intricately tied together. This project is developing a general science of thermal management for large scale storage systems, by focusing on thermal modeling and management at different levels of the system hierarchy. Thermal aware techniques for allocating data access tasks to specific disks on which data is located, for controlling the schedules and speeds of thousands of tasks and disks to optimize quality of service, and for reorganizing data layouts on disks are being developed. This project will enable better thermal management in data storage centers, which can potentially result in significant reductions in the carbon footprint caused by those. The project will train several Ph.D. students in conducting research both at the University, and through internships at Industrial Research Labs.
The goal of this NSF project was to develop tools and techniques for solving optimization problems that arise while performing thermal management in very large data storage centers. Future storage centers are envisioned to contain possibly hundreds of thousands of hard drives and other components, and typically are expected to be active 24/7. Those components generate a lot of heat, and hence the storage system must be cooled to maintain reliability. The primary motivation for our work comes from the observation that the cooling mechanism and the workload assignments in a storage center are intricately tied together. The key outcomes of the project are as follows. First, we developed techniques for thermal scheduling, where the goal is to schedule the workload in a thermally aware manner, assigning jobs to machines not just based on local load of the machines, but based on the overall thermal profile of the data center. The heat generated by jobs running on a machine raises its own as well as the temperatures of nearby machines due to hot air recirculation effects. In addition, the data center geometry plays a significant role in determining these cross-effect parameters, which are often asymmetric. We developed several thermally-aware scheduling algorithms for jobs such that either the maximum temperature in the data center is minimized, or the total profit of jobs assigned is maximized while keeping the maximum temperature below a certain limit, under several different cross-effect models. We also developed solutions to address the problem architecture level where CPU frequencies were scaled in such a way that Silicon and disk-level temperatures were explicitly controlled to be within acceptable levels rather than maintaining a specific air temperature; air temperature in itself may be an inaccurate predictor of device-level temperatures that primarily influence reliability issues. We developed various heuristics that exploited the mathematical structure of the problem. Second, we developed a suite of techniques for minimizing the total resource consumption, and thereby the total energy consumption, when analyzing or querying very large volumes of data in distributed fashion. For read-only analytical tasks, we developed a data replication and placement strategy that minimizes the number of machines that need to be involved in execution of a task. Similarly, for transactional workloads, we designed a scalable, workload-aware approach for minimizing the number of distributed transactions. We also designed and implemented a runtime platform for executing big data analysis tasks on a powerful, multi-core server. Moreover, we also studied the problem of reducing the 'on' time of servers. It has been observed that a majority of servers run at or below 20% utilization most of the time, however, they draw nearly the same amount of power irrespective of their utilization. Hence, effective batching of jobs, respecting their requirements and machine capacities, can reduce the number of servers running at any given time; we significantly improve the existing results and give new, faster algorithms, with provable performance bounds. Third, we focused on reducing communication costs in data centers as well as distributed computing environments that use these data centers (e.g., Youtube running on cellphones), which also has implications for energy minimization. A widely used billing rule for network bandwidth is the peak bandwidth rule, where the billing cycle is divided into slots, and the billing is on the peak bandwidth consumed in any slot of the billing cycle. Since the data traffic is not known a priori, we considered the online problem of minimizing the maximum bandwidth. Interestingly, the problem of peak bandwidth usage minimization has connections to the energy minimization problem. The power consumed by processors is directly proportional to the speed at which they are running, and the speed is dictated by release times, deadlines and processing requirements of the jobs so as to ensure feasibility. Hence, the optimal offline solution minimizing the peak bandwidth consumption over any time slot is the same as the optimal offline algorithm minimizing the maximum speed of (and hence, maximum power consumed by) a processor over a time period ensuring feasibility. We also investigated the problem of video data allocation on disk tracks in such a way that in presence of bandwidth constraints, graceful degradation of video quality can be achieved thereby helping reduce energy consumption in disks. Furthermore we considered the problem of distributed sharing of video data across a network of cellphones in an energy and thermal aware fashion. By combining distributed video coding and MPEG based compression approaches as well as particle filtering, we developed approaches for sharing video information in an energy and thermal aware fashion. More details about the project, and the publications that resulted from it, can be found at: www.cs.umd.edu/~samir/thermal/thermal.html.