Software bugs have been reported to take lives and cost billions of dollars annually. Studies have shown that many bugs are "cloned" (i.e., copied-and-pasted) to many places. Unfortunately, existing error detection tools have not provided programmers the ability to efficiently search for bug clones. Thus, they have to resort to ad hoc manual approaches such as grepping the source tree for bug clones.
This project aims to improve software reliability and integrity through automatic detection and repair of bug clones given a newly discovered vulnerability. It will investigate a new dimension, code similarity, for detecting software bugs. Specifically, it will investigate the feasibility of an approach that derives bug "seeds" from a new bug patch or existing static or dynamic error detection tools, searches a large code base (potentially across administrative domains) for bug clones, and automatically protects the bug clones. This approach can detect bugs in cases where many existing techniques cannot due to code complexity: detecting similarity between code is easier than deconstructing its meaning.
If successful, this project will result in accurate tools that will help to detect and repair software vulnerabilities early. Programmers will use these tools to detect and repair bug clones whenever applicable. Improvements in the reliability and security of software on which business, government, and individuals depend on will positively impact society. This project will provide a more reliable and robust computing infrastructure resilient to new threats and attacks. Integrating the proposed research into the CS curriculum will as promote reliability and security awareness.
The use of symbolic execution to find bugs and show their absence has exploded over roughly the past decade. The technique allows deeper analysis than other widely-used approaches such as static analysis.One of the more interesting applications has been using symbolic execution to show code is correct. A key problem for doing so is that typically one does not know what a piece of code intends to do. The contribution of our work was to show that this problem could be side-stepped in realistic settings by cross-checking functions that aimed to implement the same interface. On legal inputs, they must do the same thing.We demonstrated that this approach worked, that it could find many bugs,and that the techniques we devised could show their absence. A key result is that the programmer merely needed to provide the tool with the two routines to cross check --- the tool would fabricate the inputs as needed and use a theorem prover to check that all escapings bits were identical. The programmer did not have to write a specification,translate their code to a weird theorem prover language, or manually devise proof steps in order to lead a theorem-prover to sucess. We applied the technique to two mature, widely-used open source library implementations, where it: Found numerous interesting errors. Verified the equivalence of 300 routines (150 distinct pairs) by exhausting all their paths up to a fixed input size (8 bytes). Got high statement coverage --- the lowest median coverage for any experiment was 90% and the rest were 100%.