Data provenance is a fundamental issue in the processing of scientific information and beyond. Two lines of research have been pursued in recent years with direct bearing on the issues of data provenance. In one of them, provenance in workflows, the emphasis is on extracting provenance from logs of events marking the execution of different modules to various intial and derived datasets. In the other line of research, provenance in databases, the emphasis is on the propagation of provenance through the operators that make up database views, or on propagation of provenance through copy/cut-and-paste operations within and among databases. These two bodies of work employ different techniques and at first glance their results appear quite different. However, in many scientific applications, database manipulations co-exist with the execution of workflow modules, and the provenance of the resulting data should integrate both kinds of processing into a usable paradigm.
By analyzing the work on data provenance in workflows and in databases, the PIs identify what they believe are the main difficulties in unifying and integrating these two different kinds of data provenance: (1) the lack of a data model that is rich enough to capture the interaction between the structure of the data and the structure of the workflow; and (2) the lack of a high-level specification framework in which database operators and workflow modules can be treated uniformly.
In this project, the PIs aim to overcome these difficulties and thus provide concepts and tools that allow a truly comprehensive approach to the provenace of scientific data. The project's approach relies on a data model that supports nested collections and on a functional language approach to workflow specification. Based on this, the project aims to deliver a framework and tools for defining, managing and querying data provenance in complex scientific workflows that include database manipulations. The project is expected to impact bioinformatics (through interdisciplinary collaborations in the Penn Center for Bioinformatics and the Penn Genome Frontiers Institute) and phyloinformatics (through contributions to the NSF AToL program) as well as ongoing standardization work on provenance in workflows and in the business processes (eg., BPEL) community.
The results of this project are disseminated as publications, through direct collaborations and through the project website: http://db.cis.upenn.edu/research/UNIPROVE.html.
Data provenance, the process of tracing and recording the origins of data and how it moves between programs and databases, is a fundamental issue in the processing of scientific information and beyond. It is important for the verifiability and repeatability of results, as well as for debugging and trouble-shooting the process by which final results were obtained. Prior to the work of this grant, two lines of research had been pursued with direct bearing on data provenance. In one of them, provenance in workflows, the emphasis has been on extracting provenance from logs of events marking the execution of different processing steps over various initial and derived datasets. In the other line of research, provenance in databases, the emphasis has been on the propagation of provenance through query operators, or on propagation of provenance through copy/cut-and-paste operations within and among databases. These two bodies of work used different techniques, and at first glance their results appear quite different. However, in many scientific applications database manipulations co-exist with the execution of workflow modules, and the provenance of the resulting data should integrate both kinds of processing into a usable paradigm. The objective of this research was to provide a framework for integrating database and workflow provenance. Results of this research have included fundamental contributions to the theory of provenance as well as practical tools. In particular, it has extended the theory of database "provenance semirings" to workflows in which the processing steps may be affected by what has happened in the past ("stateful" execution, e.g. steps that are guided by an underlying database or steps that represent active-learning), i.e. complex applications whose control flow is guided by a finite state machine as well as by the state of an underlying database (data-dependent process models). Examples of such applications include e-commerce and crowd-mining. It has shown how questions such as "Identify the data sources that contributed some data leading to the production of publication p" (reachability queries) can be efficiently answered using workflow provenance, as well as more complex questions such as "Find all publications p that resulted from starting with data of type x, then performing a repeated analysis using either technique a1 or technique a2, terminated by producing a result of type s, and eventually ending by publishing p" (regular path queries). It has also shown how provenance support can be used by analysts to interactively test and explore the effect of hypothetical modications to the logic of an application and/or to the underlying database. Since the size of provenance generated by complex applications can be overwhelmingly large, techniques for reducing the size of provenance shown to users in response to queries were also explored. These techniques included providing "views" of provenance, i.e. personalizations of provenance according to user interest and/or authority (access control), as well as summarizations of provenance ("approximate" provenance).