Data provenance involves determining the conditions under which information was originally generated, as well as all subsequent modifications to that information and the conditions under which those modifications were performed. As systems become increasingly distributed and organizations become reliant on cloud computing for processing their data, the need to securely manage and validate the provenance of that data becomes critical.
This project develops new frameworks for evaluating secure fine-grained provenance collection and management in hosts. The research activities examine the architectures and algorithms required to make the collection and management of trustworthy provenance feasible at scale. We provide a general architecture for collecting provenance at the kernel level and consider methods of reducing the amount of provenance generated and managed while maintaining high-fidelity provenance records.
The project introduces a number of optimizations to enable scalable and performant provenance collection, including policy-based log reduction and provenance deduplication. The project's main scientific contributions include (1) the development of Linux Provenance Modules that provide complete provenance mediation within the Linux kernel; (2) the design of policy-reduced provenance through the use of mandatory access control policies to reduce the number of subjects and actions that are provenance-generating events to those of interest in a system; and (3) techniques for deduplication of provenance such that minimal records for commonly occurring events can be stored and later fully reconstructed. We will demonstrate that our approach makes provenance collection practical at scale, enabling more secure and trustworthy computing environments.