Large scale parallel systems are critical to our computational infrastructure to take on the challenges imposed by applications whose scale and demands exceed the capabilities of machines available in the market today. Pushing the limits of hardware and software technologies to extract the maximum performance, in turn, exacerbates other problems. Notable amongst these problems is the susceptibility to failures, which arises as a consequence of growing hardware transient errors, hardware device failures, software complexity, and the complex hardware/software inter-dependencies between the nodes of a parallel system. These failures can have substantial consequences on system performance, in addition to impacting the costs of maintenance/operation, thereby putting at risk the very motivation behind deploying these large scale systems.
This research is expected to make three broad contributions towards developing a runtime infrastructure, called PROGNOSIS, for failure data collection and online analysis. The first set of contributions will be on collecting and analyzing system events and failure data from an actual BlueGene/L system over an extended period of time. In addition to presenting the raw system events, the research will be developing filtering techniques to remove unimportant information and identifying stationary intervals, together with defining the attributes for logging and their frequency. The second set of contributions will be models for online analysis and prediction of evolving failure data by exploiting correlations between system events over time, across the nodes, and with respect to external factors such as imposed workload and operating temperature. The third set of contributions will be on demonstrating the uses of PROGNOSIS. Tools such as PROGNOSIS can help substantially in the development of self-healing systems, which has been noted to be an important goal in the emerging area of Autonomic Computing by several computer vendors.