A study of software code has revealed a surprising result: that software code may be just as (if not more) "natural" as natural language itself (e.g., English) in that code is highly predictable and repetitive; statistical natural language techniques may be applied quite competently for some software engineering tasks. For example, N-grams may be quite effective at suggestion and completion tasks in code. The evidence supports further exploration of the applicability of statistical NLP techniques and tools to software development activities and processes. The project explores the feasibility of establishing a scientific basis and tools for a variety of code-level software engineering functions -- including natural language summarization, code retrieval, software question answering, automated code completion, and assistive tools for disabled developers to support software engineering, forming not only a new and important domain for further research in NLP, but also a totally new approach to software development.
Statistical "big data" approaches have been used extensively for analyzing human languages, but have been used less extensively for analyzing programming languages, i.e., the artificial languages that have been designed for humans to communicate with computers. This may be because artificial languages are, by design, amenable to more traditional methods of analysis: however, such traditional analyses do not model the statistical regularities in how computer languages are be used (as opposed to the artificial constraints on how they can be used). In this project, we combined traditional software analysis with statistical natural language analysis, and used these hybrid methods to solve a number of specific problems. One problem was modeling the natural-language comments that are associated with code: we devised new statistical approaches which model both the statistical regularities in the comment text, and the statistical connections between that comment text and the code that it describes. We then used these models to build a smart editor that can auto-complete comment text very effectively, saving about half the typing a programmer would need to do in entering comments. The accompanying image shows how some of the models we developed worked on a small sample of text. The second problem we addressed with hybrid methods was assessing the semantic similarity of software modules: we showed that one important semantic relation ("coordinate terms") could be predicted much more accurately with hybrid methods than with either traditional or purely statistical approaches. The second accompanying image shows a graph of java classes, organized by this similarity relationship.