Experts known as reverse engineers specialize in studying executable programs, which are computer software programs without source code, expressed as low-level machine instructions on a computer. This task is difficult because the process of translating a program to the machine level throws away all of the information provided by high level programming languages like Java or C that are relevant to helping humans understand what the program does. This project develops analysis and machine learning techniques to augment and transform non-intuitive executable programs into more idiomatic and understandable software code. This work will help professional reverse engineers do their jobs more effectively; these professionals engage in security-critical and economically-critical activities like understanding and responding to malware and viruses, discovering vulnerabilities, and fixing bugs in legacy software systems.

Reverse engineers specialize in reading and understanding a program's behavior from its executable to analyze malware, discover software vulnerabilities, or patch legacy bugs. Unfortunately, compilers discard considerable information that is key to human understanding: comments, names, user-defined datatypes, and idiomatic structure. State-of-the-art de-compilation tools produce code that is largely not idiomatic, and can be very difficult for even experts to understand. This project develops techniques that combine insight from program analysis and machine learning to construct models to automatically transform non-intuitive compiled code into more idiomatic and understandable code. In particular, these models (A) Provide generic variable identifiers with more informative names. (B) Reconstruct names and structure of user-defined types. (C) Transform non-intuitive control flow into more idiomatic programming patterns. The project advances the state-of-the-art in both language models (requiring novelties in, e.g., neural language models and tree-based machine learning) and program analysis and transformation to constrain the search problems.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-10-01
Budget End
2022-09-30
Support Year
Fiscal Year
2019
Total Cost
$425,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213