Data provenance is key to ensuring data quality, scientific reproducibility, and tracing the lineage of data as it undergoes transformation for use n the data-driven research paradigm. The emerging Big Data resources in biomedical research and clinical care domains have highlighted multiple computational challenges to develop a scalable and high performance provenance analysis engine. These computational challenges include semantic heterogeneity across provenance information generated from disparate sources (variety), lack of scalable provenance analytical algorithms that can keep pace with large volume of data generated at a rapid velocity. Using the new PROV representation standard recommended the W3C, which is the standard body for Web technologies, together with distributed cloud computing technologies we propose to develop a highly scalable data source agnostic provenance engine. To address the lack of appropriate provenance analytical operations required to develop this provenance engine over the PROV representation model, we will follow a three-phase approach: (1) we will first develop a new algebraic graph framework for analyzing provenance graphs conforming to the PROV standard, (2) in the second phase we will use the insights from the systematic characterization of provenance analysis operations to define distributed algorithms for implementation over cloud computing technologies, and (3) in the final step, we will implement the provenance engine that will support three fundamental provenance functions of (a) scientific reproducibility, (b) data quality assurance, and (c) trust computation. The resulting provenance engine will potentially transform the use of provenance in biomedical Big Data exploration and analysis techniques in the increasing number of data repositories such as the National Sleep Research Resource for accelerating data-driven research in disease mechanisms.

Public Health Relevance

Data provenance is key to ensuring data quality, scientific reproducibility, and tracing the lineage of data for use in the 'data-driven' research paradigm. The goal of this project is to use the PROV provenance model together with distributed cloud computing technologies to develop a data source agnostic provenance engine. Results from this project enable acceleration of research in disease mechanisms through biomedical 'Big Data' analytics.

Agency
National Institute of Health (NIH)
Institute
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01EB020955-03
Application #
9275507
Study Section
Special Emphasis Panel (ZRG1-BST-N (50)R)
Program Officer
Ramos, Edward
Project Start
2015-06-01
Project End
2018-05-31
Budget Start
2017-06-01
Budget End
2018-05-31
Support Year
3
Fiscal Year
2017
Total Cost
$289,813
Indirect Cost
$89,950
Name
Case Western Reserve University
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
077758407
City
Cleveland
State
OH
Country
United States
Zip Code
44106
Valdez, Joshua; Kim, Matthew; Rueschman, Michael et al. (2017) ProvCaRe Semantic Provenance Knowledgebase: Evaluating Scientific Reproducibility of Research Studies. AMIA Annu Symp Proc 2017:1705-1714
Sahoo, Satya S; Valdez, Joshua; Rueschman, Michael (2016) Scientific Reproducibility in Biomedical Research: Provenance Metadata Ontology for Semantic Annotation of Study Description. AMIA Annu Symp Proc 2016:1070-1079
Yang, Sheng; Tatsuoka, Curtis; Ghosh, Kaushik et al. (2016) Comparative Evaluation for Brain Structural Connectivity Approaches: Towards Integrative Neuroinformatics Tool for Epilepsy Clinical Research. AMIA Jt Summits Transl Sci Proc 2016:446-54
Sahoo, Satya S; Ramesh, Priya; Welter, Elisabeth et al. (2016) Insight: An ontology-based integrated database and analysis platform for epilepsy self-management research. Int J Med Inform 94:21-30
Valdez, Joshua; Rueschman, Michael; Kim, Matthew et al. (2016) An Ontology-Enabled Natural Language Processing Pipeline for Provenance Metadata Extraction from Biomedical Text (Short Paper). On Move Meaningful Internet Syst 10033:699-708
Ramesh, Priya; Wei, Annan; Welter, Elisabeth et al. (2015) Insight: Semantic Provenance and Analysis Platform for Multi-center Neurology Healthcare Research. Proceedings (IEEE Int Conf Bioinformatics Biomed) 2015:731-736