A large number of existing parallel storage systems consist of hybrid storage components, including solid-state drives (SSD), hard disks (HDD), and tapes. Compared with high-speed storage components (e.g. SSD and HDD), tapes inevitably become an I/O performance bottleneck. Prefetching and caching are commonly employed techniques to boost I/O performance by increasing the data hitting rate of high-end storage components. However, prefetching in the context of hybrid storage systems is technically challenging due to an interesting dilemma: aggressive prefetching schemes can efficiently reduce I/O latency, whereas overaggressive schemes may waste I/O bandwidth by transferring useless data from HDDs to SSDs or from tapes to HDDs. In this research project, called FastStor, we investigate new data-mining-based multilayer prefetching techniques to improve performance of hybrid storage systems. The goals of this research are to (1) design data-mining algorithms for multilayer prefetching; (2) develop predictive parallel prefetching mechanism for SSD-based storage systems; (3) implement parallel data transfer among SSDs, HDDs, and tapes; (4) develop meta-data management schemes; and (5) implement a simulation framework named FastStor-SIM. The developed toolkit can be used to improve the I/O performance of data centers with hybrid storage systems. The research findings of this project are published in conferences or journals for public knowledge. Through the collaboration of Auburn University, South Dakota School of Mines and Technology, and the University of Southern Mississippi, PIs promote learning and training by exposing graduate and undergraduate students to technological underpinnings in the fields of storage systems.

Project Report

To achieve the tradeoff between performance and cost, many large scale storage systems consist of hybrid storage components, including solid-state drives (SSD), hard disks (HDD), and tapes. Compared with high-speed storage components (e.g. SSD and HDD), tapes are likely to become an I/O performance bottleneck. Prefetching and caching are efficient techniques to boost I/O performance by increasing the hit rate of high-end storage components. However, prefetching in the context of hybrid storage systems is technically challenging because prefetching can reduce I/O latency on one hand but can also waste I/O bandwidth and energy by pre-processing and transferring useless data. The primary goal of this project is to investigate innovative data-mining-based multilayer prefetching techniques to improve the performance without significantly increasing the energy and cost of hybrid storage systems. Three universities (Auburn University – the lead institution, Texas State University, and Texas A&M University at Kingsville) are involved in this collaborative project. Texas State University also works closely with the Earth Resources Observation and Science Center (EROS) of the U.S. Geological Survey (USGS). The research and education outcomes derived from the Texas State University grant are summarized below: 1) Research Activities: A number of research projects were conducted, including geo-visualization of massive satellite download requests provided by the USGS EROS, evaluation of conventional caching algorithms and existing data-mining-based prefetching algorithms on improving the performance of EROS hybrid storage systems, EROS user download pattern and behavior analysis, designing the popularity-oriented and user-specific prefetching algorithms and evaluating their impact on both performance and energy efficiency, developing the first SQL engine that can run SQL queries on the Intel Xeon Phi to accelerate data processing, characterizing the energy consumption of programs running on GPUs and Intel Xeon Phi. These projects have generated a number of novel algorithms and new studies, which contribute to the disciplines of hybrid storage systems, data visualization, data mining, high performance computing (HPC), green computing, and big data analytics. 2) Publications: By the time of submitting this report, eleven peer-reviewed papers have been published in highly recognized journals and IEEE/ACM sponsored conferences/workshops, which include the Journal of Cluster Computing, Journal of Applied Remote Sensing, Journal of Environmental Modeling & Software, ACM/IEEE Supercomputing Conference (SC), International Conference on Parallel Processing (ICPP), International Conference on Big Data Science and Computing, IEEE International Conference on Networking, Architecture, and Storage, IEEE International Performance Computing and Communications Conference (IPCCC). In addition, a book chapter has been accepted by the book of Big Data Algorithms, Analytics, and Applications and is currently in press. 3) Training: This NSF project provided ample opportunities for both graduate and undergraduate students at Texas State University to conduct research in the field of high performance computing, data mining, big data analytics, and hybrid storage systems. Four graduate students and four undergraduate students participated in the aforementioned research projects led by the PI. Many of them made impressive achievements by authoring journal, conference or workshop papers, among which six papers are co-authored by graduate students and four papers are co-authored by undergraduate students. 4) Education: The research approaches and results have been introduced into both undergraduate and graduate level courses to benefit a large group of students at three institutions. Students from these classes are able to leverage the according topics to gain first-hand experience and intuition in the fundamental concepts and research frontiers of data mining, big data analytics, and hybrid storage systems. 5) Broad Impact: The PIs strive to attract minority students to be involved in the research projects supported by this grant. Ivan Zecena, a Hispanic student at Texas State University, is one of the exemplary minority students. Ivan first worked with PI Zong as an undergraduate research assistant in Fall 2011. He was motivated to conduct research, applied the master program and was accepted right after he received his Bachelor degree. During Ivan’s graduate study at Texas State University, he has published two papers and won several awards, which include the travel award of IEEE International Symposium on Workload Characterization (2012) and the Texas State University Best Research Poster Award for Minority Students (2013). He was also accepted by the competitive Broader Engagement program of the Supercomputing conference (SC12) and the NSF supported Extreme Science and Engineering Discovery Environment (XSEDE) program. Ivan graduated in August 2013 with his master degree from Texas State University and is currently working at the HPC group of General Motors (GM). Sarah Abdulsalam, a female graduate from Egypt, was accepted by the Broader Engagement program of the Supercomputing conference (SC14). She has successfully submitted her first research paper. 6) Released Data: The research findings of this project are published in journals, conferences or workshops for public knowledge. The Vis-EROS project, which demonstrates how visualization can help data analytics and prefetching algorithm design, is also available online at: http://cs.txstate.edu/~zz11/viseros/

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1212535
Program Officer
M. Mimi McClure
Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-11-30
Support Year
Fiscal Year
2012
Total Cost
$169,816
Indirect Cost
Name
Texas State University - San Marcos
Department
Type
DUNS #
City
San Marcos
State
TX
Country
United States
Zip Code
78666