Systems today must cope with failures induced by many factors outside the control of the organization producing the software: faults in infrastructure and components developed by third-parties, unpredictable loads, and variable resources. Modern systems must therefore take increasing responsibility for problem detection and repair at runtime. Effective fault detection and repair could be greatly enhanced by run-time fault diagnosis and localization -- the ability to identify the source of problem so that appropriate actions can be taken either by a human operator or automated mechanisms to repair the system.
In this research we are developing new foundations for run-time fault diagnosis and localization. To do this we are extending and synthesizing recent advances in two areas. The first is the use of architecture models for monitoring a system at run-time. The second is the use of spectrum-based reasoning for fault localization (SFL). SFL is a lightweight technique that takes as its input a form of trace abstraction and produces a list of likely fault candidates, ordered by probability of being the true fault explanation. It has been used with impressive results during design time but thus far has not been exploited at runtime in the context of architecture-based monitoring and diagnosis.
This research will improve the trustworthiness and robustness of modern software systems by providing new techniques for diagnosing faults while a system is running, thereby providing an improved basis for fault detection and resolution.