SHF:  Small:  Failure Events Modeling and Analysis for Proactive Management in Highly Dependable Systems

Xu, Chengzhong

Abstract

In large-scale computer systems, component failures are no longer rare events. As the scale of the systems continues to increase, their reliability and service availability become an increasingly critical concern. Recent IT expenditure analyses also show that the worldwide spending in server management and administration has surpassed the cost of new server acquisition. Conventional reactive trouble-shooting measures and conservative check-pointing approaches are often counter-productive or may cause a long time service disruption. The goal of this FEMA project is to develop modeling and analytical methodologies and tools to characterize the systems failure dynamics for proactive failure management in highly dependable systems.

This FEMA project is carried out in three aspects. First is the development of an aggregated spherical covariance model that characterizes the failure dynamics quantitatively. The model centers on a failure signature concept that correlates a group of OS-level performance parameters and operation-level job allocation information to different types of fault events in both space and time domains. Second is an innovative application of statistical learning methods for failure prediction. Different failures types in different system scopes have different failure dynamics and different amount of history data for training; different prediction metrics pose different requirements for prediction granularity. Various supervised, unsupervised, and reinforcement learning algorithms find their applications in different scenarios. Third is the development of system reliability traces for offline evaluation and a methodology for online prediction in production systems. The trace not only contains a log of failure events, but also their corresponding operational contexts that are necessary for attaining high prediction accuracy.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 1016966
Program Officer: Almadena Chtchelkanova

Project Start
Project End
Budget Start: 2010-09-01
Budget End: 2015-08-31
Support Year
Fiscal Year: 2010
Total Cost: $467,768
Indirect Cost

SHF: Small: Failure Events Modeling and Analysis for Proactive Management in Highly Dependable Systems
Xu, Chengzhong
Wayne State University, Detroit, MI, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments