Dependability is a requirement for computer systems; however, research on dependable systems is hampered by a lack of real and publicly available failure data. This can result in productive paths of research being closed to most researchers and, conversely, unproductive research being performed due to faulty assumptions about the manner in which real systems fail. The goal of this project is to plan a collaborative effort to collect, curate, and provide public access to failure data for large scale computer systems through a community repository. One challenge is that failure data is considered sensitive by the owners. The ultimate goal of this project is to collect the data from some of the NSF-funded large cyberinfrastructure projects, such as NEES, LIGO, XSEDE, and NRAO.
The specific goal of this planning project is to collect requirements from potential praticipants (both users and contributors of data sets) and seed a prototype repository with data sets from two of the largest and latest clusters at Purdue. The data sets will comprise static information, dynamic information about the workloads, and failure information, for both planned and unplanned outages.
The broader impact in the project will be achieved through the dissemination of the data sets to a wide variety of researchers, and perhaps even, practitioners. The datasets will let people run their campus clusters more efficiently, i.e., with fewer failures, at higher utilization and energy efficiency.