SBIR Phase I: Source-code recovery from machine code for security analysis and enforcement

Kotha, Aparna

Abstract

The innovation in this project is to develop software to recover equivalent C source-code programs from commercial off-the shelf machine-code programs compiled from any programming language, and then analyze the code for security purposes. Optional run-time checks for security enforcement can be injected in the output code. The output source-code is functional: it can be modified, recompiled, and executed as required. Because of extensive executable analyses, the recovered source-code is readily comprehensible with features like symbols, types, functions, arguments, return values, and control-flow constructs. Alternately, the mechanism can recover the intermediate representation of a well-known open-source compiler, allowing machine-code analyses with source-code compiler methods. This is a significant advancement in bridging the gap between machine-code and source-code analysis. The current prototype has been successfully evaluated with executables compiled from over two million lines of source code. Additional research is being conducted in two directions. First, methods are being devised to detect interesting features in malicious software like the underlying communication mechanism, input/output channels and information flow. These methods are enhanced by innovations in analyzing memory locations in machine code, rather than just registers, yielding greater analysis precision. Second, several hybrid methods are being explored to statically analyze obfuscated executables, optionally aided by dynamic information.

The broader/commercial impact of the innovation is a dramatic improvement in the speed, efficiency and efficacy in countering cyber threats, bringing a game-changing capability in cybersecurity for both desktop and mobile platforms. President Obama recently cited cyber-threats as one of our most serious economic and national security challenges. Cyber-crime costs the US economy billions of dollars and poses a direct threat to our national infrastructure and financial institutions. The losses from theft of intellectual property alone cost American companies around $250 Billion per year. The innovation has the potential to enable orders-of-magnitude productivity improvements across the cyber security spectrum including malware analysis, exposing undesirable behavior in untrusted code, detecting vulnerabilities from proprietary software, and enforcing security. The mechanism being developed results in a precise discovery of features and robust defense measures against the threats. It also enables modification and maintenance of legacy software whose source code has been lost. Consequently, the mechanism enables a substantially faster, automated, and more detailed analysis of cyber-threats resulting in a more robust defense capability. This ability directly contributes to minimizing losses to the US economy. Better protection of our IP and trade secrets also contributes to minimizing American job losses.

Project Report

Today, computer software is a critical component of every industry and enterprise. According to a recent report by Gartner, an information technology research and advisory firm, worldwide enterprise software spending is expected to reach $296 billion in 2013, amounting to 32% of IT services spending and 8% of total IT investments with an annual growth rate of 6.4%. It also quotes security and big data quality management (performance) as key drivers. In this respect, enterprises require tools to identify and fix any performance issues arising in their systems along with security tools to secure their systems from hackers. Software is broadly developed in two flavors – binary code and managed code. Both kinds of software are typically developed as source-code in programming languages such as C, C++, JAVA and .NET. In case of managed code, source-code developed in languages such as JAVA is translated to bytecode using a compiler. This bytecode is executed inside another software called the interpreter. On the other hand, source-code written in languages such as C, C++ is translated by a compiler into binary code. Binary code executes directly on the computer and is interchangeably referred to as machine code, native code or executable code. Although many enterprise applications are written in managed code, binary code applications are prevalent in several important components of modern application software stack. Performance-critical and legacy applications are often developed in binary code, and are employed very frequently in finance, telecommunications, insurance, defense and aerospace sectors. Backbone components of modern software such as all database servers, web servers, messaging servers and virtual machines are nearly all written in binary code. In addition, nearly all malicious applications, such as computer viruses and worms, are binary code. The process of software development in the form of binary code poses several challenges in meeting the required goals of performance and security. Software developers keep their source code as their own intellectual property (IP). The binary code representation, available to end-users, can only be executed on machines and is not comprehensible to humans. It is also extremely hard even for a tool to understand and manipulate it. SecondWrite LLC is building software solutions that will enable enterprise IT teams in meeting their performance and security requirements of binary code software. The innovative software tools are based on our underlying patent pending technology, called binary rewriting, of updating and modifying binary code without access to source code. Specifically, the outcomes of the Phase I project are the following: We identified that low-overhead binary rewriting can be applied to solve an important challenge in application performance monitoring (APM). APM is a $2.1 billion industry which provides solutions for identifying performance bottlenecks in user applications. However, all existing vendors only provide solutions for monitoring managed code applications such as JAVA or .NET. We are addressing this existing gap by providing a novel capability of monitoring binary code directly. We have made significant technical and business progress with a leading APM vendor. We are in the final stages of negotiating an OEM deal where our tool will be available as part of their overall application monitoring framework. On technical front, we have co-developed a framework to directly monitor binary code applications. This novel tool will be made available to a set of customers for alpha prototype testing in a Fall 2014. Our market discovery in the cybersecurity space revealed an important limitation of existing security solutions that aim to detect malware. Malware is software written with malicious intent of causing disruption,or stealing information, money, or identities without the consent of the owner or subject of the data. Several kinds of cybersecurity solutions execute suspicious programs inside an isolated environment to uncover its behavior. Our market discovery revealed that such solutions fail to uncover behavior of evasive malware, a modern sophisticated class of malware that deliberately hides its behavior. We have designed a system to automatically detect and stop infection from such evasive malware. We developed a prototype tool to recover a high-level abstraction from a potential malware for automatic identification of malicious features such as keystroke logging and presence of encryption algorithms. This automatic discovery will enable incidence response teams at enterprises to reason about malware without resorting to complex manual reverse engineering. Based on a demonstration of early prototype of our tool, our potential customers recommended extending our tool to automatically discover a more diverse set of features including mouse-click logging and network calls.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Industrial Innovation and Partnerships (IIP)
Type: Standard Grant (Standard)
Application #: 1315099
Program Officer: Peter Atherton

Project Start
Project End
Budget Start: 2013-07-01
Budget End: 2014-06-30
Support Year
Fiscal Year: 2013
Total Cost: $150,000
Indirect Cost

SBIR Phase I: Source-code recovery from machine code for security analysis and enforcement
Kotha, Aparna
Secondwrite, Bethesda, MD, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments