Dependability is a requirement for computer systems; however, research on dependable systems is hampered by a lack of real and publicly available failure data. This can result in productive paths of research being closed to most researchers and, conversely, unproductive research being performed due to faulty assumptions about the manner in which real systems fail. The goal of this project is to plan a collaborative effort to collect, curate, and provide public access to failure data for large scale computer systems through a community repository. One challenge is that failure data is considered sensitive by the owners. The ultimate goal of this project is to collect the data from some of the NSF-funded large cyberinfrastructure projects, such as NEES, LIGO, XSEDE, and NRAO.

The specific goal of this planning project is to collect requirements from potential praticipants (both users and contributors of data sets) and seed a prototype repository with data sets from two of the largest and latest clusters at Purdue. The data sets will comprise static information, dynamic information about the workloads, and failure information, for both planned and unplanned outages.

The broader impact in the project will be achieved through the dissemination of the data sets to a wide variety of researchers, and perhaps even, practitioners. The datasets will let people run their campus clusters more efficiently, i.e., with fewer failures, at higher utilization and energy efficiency.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1405906
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2014-07-15
Budget End
2016-06-30
Support Year
Fiscal Year
2014
Total Cost
$65,891
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907