Tracing the lineage of scientific data and assertions is critical for the checks and balances that once ensured scientific fidelity (Collins & Tabak, 2014). As data become pervasively digitized, generating and following lineages automatically and at scale increases the usefulness and quality of conclusions. A significant challenge in data-intensive science is generating that lineage-the provenance-of scientific information, while facilitating retrieval and re-execution. We hypothesize such capabilities improve the reproducibility of assertions and make data more useful to society. Our objective is to build application programming interfaces for provenance, data-integrity, storage, and reproducible workflows that empower researchers to record, retrieve, and re-run scientific lineages. The rationale for the proposed research is that the value of the scientific data is enhanced by being able to retrospectively reproduce a result and by understanding its origins for future use. Provenance also facilitates measurement of data's importance-its impact. Guided by strong preliminary work, we will test our hypothesis by pursuing two specific aims: (1) Building APIs for provenance, data management, data integrity, and re-executable workflows, (2) Providing a platform for storing and deploying containerized compute environments that also serves as a learning laboratory for reproducible data science. This approach is innovative in focusing on flexibility and accommodating the myriad use cases across biomedical science, while pro- viding a hub for training investigators in reproducible data science. By creating an open source Flexible Re- search Data Service, the proposed research will significantly impact our ability to make our investments in biomedical research more useful.

Public Health Relevance

The proposed research is relevant to public health by developing an array of services that help researchers generate more robust and reproducible data-intensive science. It is relevant to the NIH's mission of supporting the application of knowledge to reduce the burdens of human disability.

Agency
National Institute of Health (NIH)
Institute
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01EB020957-03
Application #
9270029
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Duan, Qi
Project Start
2015-06-01
Project End
2019-05-31
Budget Start
2017-06-01
Budget End
2019-05-31
Support Year
3
Fiscal Year
2017
Total Cost
Indirect Cost
Name
Duke University
Department
Biostatistics & Other Math Sci
Type
Schools of Medicine
DUNS #
044387793
City
Durham
State
NC
Country
United States
Zip Code
27705