Many applications require high reliability and availability. Unfortunately, as software has grown in size and complexity, many software bugs escape from testing into production runs and cause computer failures in real world. When a production run system fails, software engineers are frequently called upon emergency to diagnose and solve the issue within a tight time schedule. Because such errors directly impact customers? business, vendors make diagnosing and fixing them as the highest priority. Since in many cases it is impossible to reproduce production-run failures in house due to various reasons (privacy, execution environments, etc.), the common practice is that customers send back the logs generated by the failed system. Such logs are the sole data source (in addition to source code) for software engineers to troubleshoot the occurred failure. Based on what are in the logs, they manually infer what may have happened to narrow down the root cause.

Unfortunately, the above diagnosis process is mostly manual, very often a trial-and-error guess game and therefore is time-consuming, error-prone and also expensive in terms of both labor cost and system down time. Especially because log messages are added in an ad-hoc way, many of them do not provide precise, informative clues to help narrow down the root cause. Furthermore, the rapidly growing size and complexity as well as software aging has greatly affected modern software?s diagnosability.

To enable developers to quickly troubleshoot production-rune failures and shorten system downtime, we propose automatic log inference and informative logging to make real-world software more diagnosable. We not only will investigate new diagnosis tools that can analyze logs and source code together to help software engineers narrowing down the possible root causes, but also will explore new ways to automatically enhance software logging to make log messages more effective and efficient for diagnosis. As software has been widely used in our daily life, software reliability is becoming an important issue. Our proposed solutions will allow software engineers to quickly identify root causes and patches to fix the problem, which would significantly reduce the amount of system down time. As such, it benefits both software/system vendors and computer users, especially those financial companies where an hour of down time can result in multiple millions of dollars loss in business.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1017784
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2010-09-01
Budget End
2016-08-31
Support Year
Fiscal Year
2010
Total Cost
$477,979
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093