Large scale parallel systems are critical to our computational infrastructure to take on the challenges imposed by applications whose scale and demands exceed the capabilities of machines available in the market today. Pushing the limits of hardware and software technologies to extract the maximum performance, in turn, exacerbates other problems. Notable amongst these problems is the susceptibility to failures, which arises as a consequence of growing hardware transient errors, hardware device failures, software complexity, and the complex hardware/software inter-dependencies between the nodes of a parallel system. These failures can have substantial consequences on system performance, in addition to impacting the costs of maintenance/operation, thereby putting at risk the very motivation behind deploying these large scale systems.

This research is expected to make three broad contributions towards developing a runtime infrastructure, called PROGNOSIS, for failure data collection and online analysis. The first set of contributions will be on collecting and analyzing system events and failure data from an actual BlueGene/L system over an extended period of time. In addition to presenting the raw system events, the research will be developing filtering techniques to remove unimportant information and identifying stationary intervals, together with defining the attributes for logging and their frequency. The second set of contributions will be models for online analysis and prediction of evolving failure data by exploiting correlations between system events over time, across the nodes, and with respect to external factors such as imposed workload and operating temperature. The third set of contributions will be on demonstrating the uses of PROGNOSIS. Tools such as PROGNOSIS can help substantially in the development of self-healing systems, which has been noted to be an important goal in the emerging area of Autonomic Computing by several computer vendors.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
0509164
Program Officer
Frederica Darema
Project Start
Project End
Budget Start
2005-08-01
Budget End
2006-07-31
Support Year
Fiscal Year
2005
Total Cost
$79,999
Indirect Cost
Name
Rutgers University
Department
Type
DUNS #
City
New Brunswick
State
NJ
Country
United States
Zip Code
08901