Users rely on computers to save valuable data, with the expectation that the data will be available and accessible at any time. Consequently, storage and server systems have to provide stringent reliability guarantees. Many techniques - redundant data copies, multiple servers, and backup hardware - are employed in modern data centers to prevent data loss due to failures. However, these techniques consume more energy either to sustain additional hardware or to perform additional software tasks that keep disks busy longer. This poses trade-offs between using energy management and data reliability improvement - both of which are critical technologies that will direct the future of computer systems development. Thus, this project investigates the combined impact of energy efficiency and data reliability on storage systems. The utilization of a novel metric for capturing the energy-reliability interactions allows for designing optimization techniques that provide integrated reliability and energy management for modern storage systems. The results from this project are expected to lead to a better understanding of the interactions of energy management and reliability improvement techniques in storage systems, and to novel energy-efficient and reliable storage system organizations and designs that balance reliability and energy efficiency. The developed mechanisms will also enable further research in energy efficient and reliable systems at scale. Moreover, the project employs an integrated research and education approach for training both undergraduate and graduate researchers, especially from underrepresented groups. The training will instill critical system development skills and provide valuable learning opportunities in designing energy-efficient and reliable computer systems.
Energy management and reliability in computing systems are both critical in designing future IT infrastructure. Extensive research studies have been performed individually on energy management and reliability, however, a research effort that takes both areas into consideration has been lacking. The focus of this project is on developing techniques for identifying and quantifying energy efficiency and reliability interactions. This also enables us to understand the behavior of current energy efficient systems when reliability is incorporated into the designs. We have developed a new metric, the energy-reliability product (ERP) that provides a unified mechanism for evaluating both energy efficiency and data reliability in the system. The ERP metric developed as part of this research enables easy incorporation of multi-constraint optimizations in storage system design, and provides means to design a tiered energy-efficient solution for storage in the emerging Hadoop clusters. ERP captures energy and reliability of individual disks as well as for a distributed storage system. The work was completed and resulted in an effective solution for optimizing new storage systems under reliability and energy efficiency constraints. We have shown that ERP can help identify efficient distribution of disk idle time to energy and reliability management. In our study, we have relied on simpler techniques both for energy and reliability management, but the evaluation techniques using ERP can guide the design of more advanced energy-reliability management as well. A storage system that factors these optimizations, SARD, was also developed. SARD showed that it is possible to incorporate the energy-efficiency into storage systems without compromising performance. We also designed a de-duplication-aware extension to the NFS protocol for HPC systems to remove redundant data and better manage the network I/O bandwidth and realize better energy utilization. Our results show that by sharing information between clients and storage servers, the network traffic can be reduced significantly and provide higher energy efficiency. The results from this project are expected to lead to a better understanding of the interactions of energy management and reliability improvement techniques in storage systems, and to novel energy-efficient and reliable storage system organizations and designs that balance reliability and energy efficiency. The developed mechanisms will also enable further research in energy efficient and reliable systems at scale. Our applications of the metrics developed in this project to Hadoop setups provide that classifying hot and cold datasets and managing them separately in terms of replication provides an energy-efficient solution that supports both high availability and reduced energy consumption compared to the current deployments. To this end, we also developed the HadoopSim simulator, which benefits other researchers by providing them with the simulation platform to study Hadoop cluster performance and reliability based on the underlying network topology. The simulator is easily extensible to allow incorporation of other modules and protocols. Consequently, jumpstarting the research in any new project involving Hadoop storage and compute clusters. The research fosters collaboration and helps scientists tackle problems that have so far involved prohibitively-large data centers in other disciplines. The tools and technologies developed in this project can support systems from a wide-ranging vendor and user base, as diverse as computational biology, fusion, combustion, astrophysics, neutron scattering and climate modeling, as they all involve large data. Our research so far allows for better utilization and programming of heterogeneous resources in HPC clusters for scientific and enterprise computing. Thus, it fosters more collaboration and helps scientists tackle problems that have so far involved prohibitively-large data sizes. An HPC center can support applications from a wide-ranging HPC user base, as diverse as computational biology, fusion, combustion, astrophysics, neutron scattering and climate modeling. This research project trained students in building large, practical and efficient computer systems to support energy conservation while still giving preference to reliability. Students trained on developing applications and systems with stringent energy schemes and strong reliability guarantees, which will enhance their future technical careers. In addition, the PIs have introduced two new courses Green Computing and File and Storage Systems. The classes offer students the opportunity to learn about energy, especially in the storage and I/O sub-system, and the related factors such as impact on performance, reliability, or usability of devices under energy management.