System intrusions have becoming more subtle and complex. Attackers now covertly observe and probe systems for prolonged periods before launching devastating attacks. In such an environment, it has grown prohibitively difficult for system administrators to identify suspicious events, correlate these events into an attack pattern, and determine an appropriate response. Data Provenance is a method of modeling a system's execution in the form of a causal relationship graph, allowing investigators to trace the ancestry of data objects and identify relationships between seemingly independent events. The goal of the proposed work is to develop techniques that enable the use of data provenance as an expressive and efficient monitoring tool in large distributed systems. These mechanisms will enable unprecedented capability to reason about system events, centrally monitor activities within data centers, and express fine-grained enforcement of security properties based on the historical flow of data. Research and software artifacts will be made available to the broader community through the Linux provenance web site.

The proposed work will examine central challenges related to expressivity and scalability that currently prevent the further proliferation of provenance-based auditing techniques. To address the semantic gap that has traditionally prevented system-layer auditing from being able to explain higher-level application behaviors, this project pursues the design of universal provenance mechanisms that leverage binary analysis to transparently identify siloed application-layer logging activities, extract their semantics, and graft the information onto a causal relationship graph that encodes the entire system's execution. Grammar induction techniques will be leveraged to overcome the tremendous storage burden of provenance and provide a scalable central monitoring framework for data centers. After enriching system-layer auditing and enabling the efficient communication of suspicious activities via provenance traces, data provenance will be integrated into enforcement mechanisms to address critical security challenges including regulatory compliance, information flow control, and fault attribution. The advancement of state-of-the-art of provenance-based tracing and enforcement should establish a new baseline for reasoning about the flow of data in today's complex computing systems.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1750024
Program Officer
Phillip Regalia
Project Start
Project End
Budget Start
2018-04-01
Budget End
2023-03-31
Support Year
Fiscal Year
2017
Total Cost
$305,833
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820