Storage systems are exploding in size, complexity and popularity, and as they grow, they expose themselves to a variety of failures. As such, the software that supports these systems must be able to tolerate a growing number of failures. Erasure coding is a technique that allows storage nodes to be enriched with extra coding information so that if one or more nodes fail, the original data may be recovered from the survivors. While erasure codes have existed for decades, there is a considerable gap in understanding constructions, performance and properties of codes that tolerate multiple failures. This project's mission is explore, design, evaluate and disseminate erasure codes for multiple failures that significantly improve the performance and failure coverage of current codes. The research component is centered around unifying disparate techniques from classical coding theory, disk array technology, and asymptotic graph theory.

This work will impact all applications where multiple storage nodes are used in tandem, and the risk of one or more nodes failing is significant. New, efficient codes will be developed and evaluated with respect to the best currently known codes. Additionally, the implementations of these codes will be disseminated to the community of systems programmers that need them, and tutorial material will be developed so that the codes may have wide impact. This is a serious issue, as most current codes are presented in a theoretcially formal way that puts them beyond the reach of the programmers who need to implement them. Finally, erasure codes will be used as the basis for motivational programs on math and science at K-12 schools in East Tennessee.

Project Report

We as a society rely heavily on computer storage for personal videos, news feeds and scientific data sets. While our data is of utmost importance, it is a fact of life that hard disk drives fail, and those failures expose us to the loss of data. Today's storage systems are made up of multiple disks, and until recently, these have been designed to tolerate the failure of any single disk. Techniques to tolerate multiple disk failures have not been explored as thoroughly, and heretofore their performance has not been good. This research project, funded by the NSF, is to design, analyze and implement techniques called erasure codes to tolerate multiple disk failures with high performance. A major outcome of the research project has been the discovery of multiple techniques for tolerating double-disk failures in large storage systems. These techniques have filled important gaps in the theory and practice of erasure codes and have already received attention from both academia and industry. A second outcome of the project has been the development and support of an open-source programming library called Jerasure that allows storage system developers to utilize the most recent and best performing erasure coding techniques to protect their systems from multiple disk failures. This library and its constituent techniques are patent-free, and therefore may be leveraged by both academic and industrial projects, striving to keep our nation's data safe from failures. A final outcome of the project has been a detailed assessment of the performance of a variety of open-source erasure coding libraries, not only to provide information on how these libraries perform on current hardware, but also to give an idea of how theoretical performance maps to real implementations. In addition to the outcomes above, this award has funded ten Undergraduates from the University of Tennessee to perform research on erasure codes. This provides them not only with tools and experience to help them in their future careers, but gives them a first-hand glimpse into the excitement that Computer Science research can offer. The majority of these students have decided to pursue research with graduate study in Computer Science at the University of Tennessee and elsewhere.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
0615221
Program Officer
Krishna Kant
Project Start
Project End
Budget Start
2006-08-01
Budget End
2010-07-31
Support Year
Fiscal Year
2006
Total Cost
$420,267
Indirect Cost
Name
University of Tennessee Knoxville
Department
Type
DUNS #
City
Knoxville
State
TN
Country
United States
Zip Code
37996