Given a closed-source program, such as most of proprietary software and viruses, binary code analysis is indispensable for various tasks, such as vulnerability discovery and malware analysis. Some analysis techniques scale well but cannot accurately capture the program semantics, while others are more accurate but limited in scalability. How to improve both the accuracy and scalability of binary code analysis is an intriguing unresolved problem. The objective of this research is to build novel binary code analysis approaches and techniques based on recent advances in deep learning to achieve both high accuracy and scalability. This project will not only advance cross-architecture binary code analysis, but also propel its applications in vulnerability discovery, plagiarism detection, and malware understanding, especially in the context of heterogeneous IoT devices. Educational resources from this project will be disseminated through a dedicated web site. This research will foster new research and education opportunities at University of South Carolina. The outreach and educational activities that engage students from Benedict College (HBCU) in the research will broaden the participation of underrepresented groups in computer security research.

This research emphasizes code semantics-oriented learning by building deep learning based code analysis in a bottom-up approach, aiming to extract semantic information from binary code layer by layer. The technical aims of the project are divided into three thrusts. First, inspired by Neural Machine Translation, instructions and basic blocks are represented as embeddings (i.e., high-dimensional vectors), just like NMT represents words and sentences as points in high-dimensional spaces to facilitate further handling. Second, the captured code semantics at the instruction and basic-block layers are used for analysis at the control flow graph level. The layered learning process fits the hierarchy of semantics inherent in code from instructions, basic blocks, to control flow graphs. Thus, it minimizes the loss of semantic information and keeps scalable. Third, whether and how the proposed techniques can be extended to handle certain obfuscations will be investigated.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1953073
Program Officer
Sol Greenspan
Project Start
Project End
Budget Start
2020-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2019
Total Cost
$416,947
Indirect Cost
Name
University of South Carolina at Columbia
Department
Type
DUNS #
City
Columbia
State
SC
Country
United States
Zip Code
29208