Given a closed-source program, such as most of proprietary software and viruses, binary code analysis is indispensable for various tasks, such as vulnerability discovery and malware analysis. Some analysis techniques scale well but cannot accurately capture the program semantics, while others are more accurate but limited in scalability. How to improve both the accuracy and scalability of binary code analysis is an intriguing unresolved problem. The objective of this research is to build novel binary code analysis approaches and techniques based on recent advances in deep learning to achieve both high accuracy and scalability. This project will not only advance cross-architecture binary code analysis, but also propel its applications in vulnerability discovery, plagiarism detection, and malware understanding, especially in the context of heterogeneous IoT devices. Educational resources from this project will be disseminated through a dedicated web site. This research will foster new research and education opportunities at University of South Carolina. The outreach and educational activities that engage students from Benedict College (HBCU) in the research will broaden the participation of underrepresented groups in computer security research.
This research emphasizes code semantics-oriented learning by building deep learning based code analysis in a bottom-up approach, aiming to extract semantic information from binary code layer by layer. The technical aims of the project are divided into three thrusts. First, inspired by Neural Machine Translation, instructions and basic blocks are represented as embeddings (i.e., high-dimensional vectors), just like NMT represents words and sentences as points in high-dimensional spaces to facilitate further handling. Second, the captured code semantics at the instruction and basic-block layers are used for analysis at the control flow graph level. The layered learning process fits the hierarchy of semantics inherent in code from instructions, basic blocks, to control flow graphs. Thus, it minimizes the loss of semantic information and keeps scalable. Third, whether and how the proposed techniques can be extended to handle certain obfuscations will be investigated.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.