In large-scale computer systems, component failures are no longer rare events. As the scale of the systems continues to increase, their reliability and service availability become an increasingly critical concern. Recent IT expenditure analyses also show that the worldwide spending in server management and administration has surpassed the cost of new server acquisition. Conventional reactive trouble-shooting measures and conservative check-pointing approaches are often counter-productive or may cause a long time service disruption. The goal of this FEMA project is to develop modeling and analytical methodologies and tools to characterize the systems failure dynamics for proactive failure management in highly dependable systems.
This FEMA project is carried out in three aspects. First is the development of an aggregated spherical covariance model that characterizes the failure dynamics quantitatively. The model centers on a failure signature concept that correlates a group of OS-level performance parameters and operation-level job allocation information to different types of fault events in both space and time domains. Second is an innovative application of statistical learning methods for failure prediction. Different failures types in different system scopes have different failure dynamics and different amount of history data for training; different prediction metrics pose different requirements for prediction granularity. Various supervised, unsupervised, and reinforcement learning algorithms find their applications in different scenarios. Third is the development of system reliability traces for offline evaluation and a methodology for online prediction in production systems. The trace not only contains a log of failure events, but also their corresponding operational contexts that are necessary for attaining high prediction accuracy.