A large number of existing parallel storage systems consist of hybrid storage components, including solid-state drives (SSD), hard disks (HDD), and tapes. Compared with high-speed storage components (e.g. SSD and HDD), tapes inevitably become an I/O performance bottleneck. Prefetching and caching are commonly employed techniques to boost I/O performance by increasing the data hitting rate of high-end storage components. However, prefetching in the context of hybrid storage systems is technically challenging due to an interesting dilemma: aggressive prefetching schemes can efficiently reduce I/O latency, whereas overaggressive schemes may waste I/O bandwidth by transferring useless data from HDDs to SSDs or from tapes to HDDs. In this research project, called FastStor, we investigate new data-mining-based multilayer prefetching techniques to improve performance of hybrid storage systems. The goals of this research are to (1) design data-mining algorithms for multilayer prefetching; (2) develop predictive parallel prefetching mechanism for SSD-based storage systems; (3) implement parallel data transfer among SSDs, HDDs, and tapes; (4) develop meta-data management schemes; and (5) implement a simulation framework named FastStor-SIM. The developed toolkit can be used to improve the I/O performance of data centers with hybrid storage systems. The research findings of this project are published in conferences or journals for public knowledge. Through the collaboration of Auburn University, South Dakota School of Mines and Technology, and the University of Southern Mississippi, PIs promote learning and training by exposing graduate and undergraduate students to technological underpinnings in the fields of storage systems.

Project Report

This project was motivated by a global online satellite images distribution system operated at the Earth Resources Observation and Science (EROS) center of the U.S Geological Survey. Fundamental objectives of EROS include, but are not limited to, building high-speed and cost-effective massive data processing and storage systems to support online satellite images distribution. Hybrid storage systems, in which solid-state drives (SSD), hard disks (HDD), and tapes are seamlessly integrated, provide an ideal data storage solution for a wide variety of data processing centers like EROS. In the past five years, we have been focusing on prefetching techniques for large-scale hybrid storage systems, which are becoming increasingly popular. Our findings confirm that highly accessed data in a hybrid storage system can be prefetched and cached to high-speed storage components such as solid-state drives. SSD-based hybrid storage system can provide large storage capacity, high I/O performance and data reliability. Intellectual Outcomes. We have built data-mining-based multilayer prefetching mechanisms to significantly improve performance of hybrid storage systems with solid-state drives, hard disks, and tapes. Specifically, this project has achieved the following five intellectual outcomes. First, we designed and implemented a predictive prefetching approach based on weighted graph in the Lustre File System. Second, We have implemented a predictive schedule and prefetching mechanism, which was integrated into the native MapReduce runtime system. Third, We designed and implemented the pipelined prefetching mechanisms, which use application-disclosed access patterns to prefetch hinted blocks in multi-level storage systems. Fourth, we design, implement, and evaluate a new software tool, pScope, that can perform massively parallelized mass spectrometry. Last, We propose a key-aware data placement strategy for the Hadoop distributed file system on clusters. Educational Outcomes. This project has yielded years worth of education outcomes. We have established a modern Storage System Laboratory in Auburn University. We also have developed a cross-listing undergraduate/graduate courses on the subject of storage systems. We have to leverage a partnership with the Alabama Power Academic Excellence Program at Auburn University to increase underrepresented student involvement in research activities. Minority students who have participated in this project activities include two African-American students, one hispanic student, and four female students. Broader Impacts. The project has the following major impacts. (1) The project benefits the society by developing the hybrid storage system, which accelerate the research progress in data-mining-based prefetching for large-scale storage systems. (2) The availability of our novel predictive prefetching techniques significantly improve I/O performance of next-generation hybrid storage systems with long archive life and low cost. The success of this project leads to another wave of increased usage for SSD-based hybrid storage systems and benefit the economy of our country. (3) This project promotes teaching, learning, and training by exposing graduate and undergraduate students to technological underpinnings in the fields of storage systems in general and prefetching in particular. (4) We have established and maintain strong partnerships with the U.S. Geological Survey (USGS) Center for Earth Resources Observation and Science (EROS), IBM T. J. Watson Research Center and Intel to facilitate the broad dissemination of novel data-mining-based predictive prefetching technologies to the scientific and industrial sectors.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
0917137
Program Officer
M. Mimi McClure
Project Start
Project End
Budget Start
2009-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2009
Total Cost
$248,000
Indirect Cost
Name
Auburn University
Department
Type
DUNS #
City
Auburn
State
AL
Country
United States
Zip Code
36849