Software reliability affects virtually everyone. Thorough software checking is unquestionably crucial to improve software reliability, but the checking coverage of most existing techniques is severely hampered by where they are applied: a software product is typically checked only at the site where it is developed, thus the number of different states checked is throttled by those sites' resources (e.g., machines, testers/users, software/hardware configurations). To address this fundamental problem, we will investigate mechanisms that will enable software vendors to continue checking for bugs after a product is deployed, thus checking a drastically more diverse set of states. Our research contributions will include the investigation, development, and deployment of: (1) a wide-area autonomic software checking infrastructure to support continuous checking of deployed software in a transparent, efficient, and scalable manner; (2) a simple yet general and powerful checking interface to facilitate creation of new checking techniques and combination of existing techniques into more powerful means to find subtle bugs that are often not found during conventional pre-deployment testing; (3) lightweight isolation, checkpoint, migration, and deterministic replay mechanisms that enable replication of application processes as checking launch points, isolation of replicas from users, migration of replicas across hosts, and replay of identified bugs without need for the original execution environment; and (4) distributed computing mechanisms for efficiently and scalably leveraging geographically dispersed idle resources to determine where and when replicas should be executed to improve the speed and coverage of software checking, thereby converting available hardware cycles into improved software reliability.

Project Report

Software reliability affects virtually everyone. Thorough software checking is unquestionably crucial to improve software reliability, but the checking coverage of most existing techniques is severely hampered by where they are applied: a software product is typically checked only at the site where it is developed, thus the number of different states checked is throttled by those sites' resources (e.g., machines, testers/users, software/hardware configurations). To address this fundamental problem, we investigated mechanisms that enable software vendors to continue checking for bugs after a product is deployed, thus checking a drastically more diverse set of states. We eveloped novel program analysis, testing, and operating system techniques to make it practical to continuously check software. Specifically, we investigated ways to implement continous checking systems efficiently, which included tracking application states to eliminate redundant tests. We developed lightweight isolation, checkpoint, migration, and deterministic replay mechanisms that can be used for our checking infrastructure. We developed operating system virtualization mechanisms that leverage the Linux kernel to provide these features across a wide range of applications. We improved the effectiveness of various program analysis techniques sometimes by an order of magnitude and make them deployable in the Guanyin distributed checking infrastructure. These results led to numerous publications at the best venues. The systems we have built within the scope of this project are incorporated into ConEd's electrical grid and Rudin Management smart skyscrapers in New York City and the Linux operating system kernel.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0905246
Program Officer
M. Mimi McClure
Project Start
Project End
Budget Start
2009-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2009
Total Cost
$1,028,000
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027