Software is pervasive. Software typically contains a large volume of artifacts written in natural languages (NL), including code comments, change logs, manual pages, constant strings in code, and variable and function names. Software NL artifacts contain a wealth of semantic information that is often missing in code artifacts. There has been substantial existing work on analyzing NL artifacts and leveraging them in a wide range of software-engineering applications. However, most existing work is ad hoc, and is limited in its generality. Existing work typically considers NL artifacts as sources for additional information instead of first-class objects on which analysis operates (like variable types in program analysis), missing the opportunity to take full advantage of software NL artifacts. Thus, this project develops co-analysis of code and NL artifacts, which treats NL artifacts as first-class objects. In addition to advancing the state of the art, the principles, infrastructure, and techniques developed in the project are transformative, providing educational and practical tools to generate high-quality source code and software documents. These techniques improve program analysis, software maintenance, software reliability, and engineering productivity, for lower software development cost and better work and recreational lives, where software is indispensable.
The project develops a principled and sophisticated software reasoning method that couples NL analysis and program analysis. It automatically models and classifies various kinds of NL artifacts, and attributes them to the related code elements. As such, they become first-class objects just like other classic objects in program analysis (e.g., variables and statements). They can be inferred, propagated, updated, associated, and formally reasoned about, to maximize the utilization of their rich semantics (e.g., comments can be propagated to code elements that are not previously commented through program analysis). The project activities include (1) modeling, classifying, and attributing NL artifacts, through developing domain-specific language models to process, model, classify NL artifacts and attribute them to the corresponding code elements, (2) building uniform representation, propagation, and co-reasoning of NL artifacts and code artifacts, (3) producing highly accurate and scalable probabilistic inference, by leveraging probabilistic graph models to perform the uniform reasoning of both code and NL artifacts, and (4) exploring new applications of co-analysis in domains including software testing.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.