Data provenance refers to the history of the contents of an object and its successive transformations. Knowledge of data provenance is beneficial to many ends, such as enhancing data trustworthiness, facilitating accountability, verifying compliance, aiding forensics, and enabling more effective access and usage controls. Provenance data minimally needs integrity assurance to realize these benefits. Additionally, provenance data may need assurances of confidentiality (e.g., protect the identity of a reviewer in a blinded paper review process from the authors but not from the editor) or of privacy (e.g., do not disclose identity of a source without the source's consent). In the past decade there has been significant progress regarding the structure and representation of provenance data as a directed acyclic graph. However, currently there is no overarching, systematic framework for the security and privacy of provenance data and their tradeoffs with respect to the utility of provenance data. The development of such a framework is recognized as one of a handful of promising thrusts in recent reports on Federal game-changing R&D for cyber security, particularly aligned with the theme of Tailored Trustworthy Spaces.
This project is to develop a comprehensive technical and scientific framework to address the security and privacy challenges of provenance data, and the attendant tradeoffs, so that our society can gain maximum benefit from applications of provenance data. Detailed foundational research is to be performed on security enhanced data models, access control and usage models, privacy including annonymization and sanitization, integrity, accountability and risk management techniques for provenance data. This foundational research is complemented by data provenance case studies in scientific and cyber security information sharing, and construction of prototype data provenance systems at the operating systems and data layers. Moreover, reference architectures and definitions of corresponding provenance management services are to be defined, identifying how these services can be effectively deployed in enterprises, and developing a risk-management framework to guide application architects, designers and users to effectively embed data provenance in their specific context. The project results will beneficially impact society at large by increasing trustworthiness of data acquired, transmitted and processed by computer systems. From the educational side, both theory and practice of data provenance are to be integrated in the undergraduate and graduate training of students, including underrepresented minority and female students, in all the collaborative institutions of this project.