In many Big Data applications today, such as Next-Generation Sequencing, data processing pipelines are highly complex, span multiple institutions, and include many human and computational steps. The pipelines evolve over time and vary across institutions, so it is difficult to track and reason about the processing pipelines to ensure consistency and correctness of results. Provenance-enabled scientific workflow systems promise to aid here - yet such workflow systems are often avoided due to perceptions of inflexibility, lack of good provenance analytics tools, and emphasis on supporting the data consumer rather than producer. We propose to better incentivize the adoption of workflow and other provenance tracking tools: (1) Instead of requiring a single workflow system across the entire pipeline, which can be inflexible, we allow for integration across multiple autonomous systems (provenance- enabled workflow systems, provenance tracking systems for languages like Python and R, etc.), and even across steps performed without any provenance tracking at all. (2) We develop provenance reasoning capabilities specifically useful to the data provider, such as provenance analytics across time, sites, and users; finding the code modules that best explain why two results are different; regression testing to determine whether a code change would affect prior results; and reconstructing missing provenance for steps that were not captured. These capabilities are expected to lead to wider tracking of data provenance, and ultimately to more consistent, reproducible, and reliable science. We will validate this hypothesis through the evaluation of our technologies within a Next-Generation Sequencing pipeline run by one of the PIs with collaborators at other institutions.

Public Health Relevance

Settings like Next-Generation Sequencing have very complex data processing pipelines which change over time, making reasoning about data quality and consistency difficult. Data provenance tools promise to help in this respect, but are often viewed as burdensome and oriented towards the data consumer rather than producer. To incentivize adoption of provenance tracking and reasoning, we (1) make it lighter-weight to record and reconstruct the provenance of results in the data pipeline, (2) provide analytics and debugging tools over data provenance that help the data provider reconstruct missing provenance, understand changes, and troubleshoot unexpected differences in results.

Agency
National Institute of Health (NIH)
Institute
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
Type
Research Project--Cooperative Agreements (U01)
Project #
1U01EB020954-01
Application #
8876037
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Pai, Vinay Manjunath
Project Start
2015-06-01
Project End
2018-05-31
Budget Start
2015-06-01
Budget End
2016-05-31
Support Year
1
Fiscal Year
2015
Total Cost
Indirect Cost
Name
University of Pennsylvania
Department
Biostatistics & Other Math Sci
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
042250712
City
Philadelphia
State
PA
Country
United States
Zip Code
19104
Mamouras, Konstantinos; Raghothaman, Mukund; Alur, Rajeev et al. (2017) StreamQRE: Modular Specification and Efficient Evaluation of Quantitative Queries over Streaming Data. Proc ACM SIGPLAN Conf Program Lang Des Implement 52:693-708
Alawini, Abdussalam; Chen, Leshang; Davidson, Susan B et al. (2017) Automating data citation: the eagle-i experience. Proc ACM/IEEE Joint Conf Digit Libr 2017:
Davidson, Susan B; Buneman, Peter; Deutch, Daniel et al. (2017) Data Citation: a Computational Challenge. Proc ACM SIGACT SIGMOD SIGART Symp Princ Database Syst 2017:1-4
Liu, Mengmeng; Ives, Zachary G; Loo, Boon Thau (2016) Enabling Incremental Query Re-Optimization. Proc ACM SIGMOD Int Conf Manag Data 2016:1705-1720
Buneman, Peter; Davidson, Susan; Frew, James (2016) Why Data Citation Is a Computational Problem. Commun ACM 59:50-57
Wiener, Martin; Sommer, Friedrich T; Ives, Zachary G et al. (2016) Enabling an Open Data Ecosystem for the Neurosciences. Neuron 92:617-621
Wiener, Martin; Sommer, Friedrich T; Ives, Zachary G et al. (2016) Enabling an Open Data Ecosystem for the Neurosciences. Neuron 92:929
Ainy, Eleanor; Bourhis, Pierre; Davidson, Susan B et al. (2016) PROX: Approximated Summarization of Data Provenance. Adv Database Technol 2016:620-623