As scientists begin to get access to data sets that are accompanied by automatically generated provenance records, they are faced with the challenge of integrating and analyzing this metadata. Independent sources are likely to have captured provenance at distinct levels of abstraction, have different levels of completeness, used separate sets of identifiers to refer to the same artifacts, processes, and agents, and introduced dissimilar semantics in the annotations.

This research studies the problem of semi-automatically integrating and analyzing the provenance of scientific data that originates from diverse sources, with independent annotation schema, semantics that may overlap only partially, representations at different granularity, and incomplete characterizations of the activity being recorded. In particular, (i) it develops a formal framework for combining provenance, (ii) provides an extensible software system for provenance ingestion, integration, and analysis, and (iii) creates canonical provenance data sets of various sizes, granularity, and domains, that can be utilized for comparison of provenance integration and analysis algorithms.

Maintaining a record of all the transformations the data undergoes becomes increasingly critical as the length of the analysis grows and the age and diversity of sources of the data grow. Such provenance metadata can address a range of queries. For example, in situations where only derivative data is preserved, a provenance record can help validate claims about the procedures used to obtain the final results. Concerns about whether privacy-sensitive data (such as information from patient records) has been used in contravention to legal or security policies can be alleviated by checking for violations in the provenance records.

More information about the project can be found at:

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Sri International
Menlo Park
United States
Zip Code