As software systems have grown in size, complexity and cost, it has become increasingly difficult to deliver software bug-free to end users, which result in many software failures during production runs at the user site. While much work has been conducted on software failure diagnosis, most previous work focuses on off-site diagnosis (i.e. diagnosis at the development site with involvement of programmers) and thereby is insufficient to diagnose production-run software failure at the user site.
To effectively address production-run failures, we propose a novel approach that automatically performs on-site software failure diagnosis right at the moment of a software failure and provide programmers a detailed diagnosis report regarding the occurred failure, including bug type, bug location, likely root cause, fault propagation chain, failure-triggering input, failure-triggering execution environment, potential temporal fixes, etc, without violating user?s privacy concerns or imposing large overhead during normal execution. To achieve the ambitious goal, the proposed research tightly integrates innovations from multiple layers: (1) Low-overhead operating and run-time system support to capture the failure moment without imposing large overhead during normal execution. (2) A novel, extensible, customizable, human-like failure diagnosis protocol. (3) Novel program analysis techniques that are specifically designed for on-site failure diagnosis. (4) Leverage existing and emerging hardware support and simple hardware extensions to reduce overhead.(5) A library-based API to allow applications to control or customize the diagnosis process if necessary.