Software is pervasive. Software typically contains a large volume of artifacts written in natural languages (NL), including code comments, change logs, manual pages, constant strings in code, and variable and function names. Software NL artifacts contain a wealth of semantic information that is often missing in code artifacts. There has been substantial existing work on analyzing NL artifacts and leveraging them in a wide range of software-engineering applications. However, most existing work is ad hoc, and is limited in its generality. Existing work typically considers NL artifacts as sources for additional information instead of first-class objects on which analysis operates (like variable types in program analysis), missing the opportunity to take full advantage of software NL artifacts. Thus, this project develops co-analysis of code and NL artifacts, which treats NL artifacts as first-class objects. In addition to advancing the state of the art, the principles, infrastructure, and techniques developed in the project are transformative, providing educational and practical tools to generate high-quality source code and software documents. These techniques improve program analysis, software maintenance, software reliability, and engineering productivity, for lower software development cost and better work and recreational lives, where software is indispensable.

The project develops a principled and sophisticated software reasoning method that couples NL analysis and program analysis. It automatically models and classifies various kinds of NL artifacts, and attributes them to the related code elements. As such, they become first-class objects just like other classic objects in program analysis (e.g., variables and statements). They can be inferred, propagated, updated, associated, and formally reasoned about, to maximize the utilization of their rich semantics (e.g., comments can be propagated to code elements that are not previously commented through program analysis). The project activities include (1) modeling, classifying, and attributing NL artifacts, through developing domain-specific language models to process, model, classify NL artifacts and attribute them to the corresponding code elements, (2) building uniform representation, propagation, and co-reasoning of NL artifacts and code artifacts, (3) producing highly accurate and scalable probabilistic inference, by leveraging probabilistic graph models to perform the uniform reasoning of both code and NL artifacts, and (4) exploring new applications of co-analysis in domains including software testing.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
1901242
Program Officer
Sol Greenspan
Project Start
Project End
Budget Start
2019-07-15
Budget End
2023-06-30
Support Year
Fiscal Year
2019
Total Cost
$692,005
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907