Our team brings to the DCPPC decades of experience in building widely used software for managing, accessing, analyzing, and sharing large distributed data, and in enabling collaborative science. We bring deep expertise with software and services directly relevant to NIH Data Commons goals, including Globus cloud services for security and data management; DERIVA for data organization and navigation; and Jupyter and the Galaxy-based Globus Genomics for user-friendly analysis. In addition, we have a successful track record with the creation and operation of vertically integrated systems for specific biomedical communities, such as the FaceBase, RBK, and GUDMAP consortia and many genomics research groups. These projects span the Common Fund, NCI, NHGRI, NIMH, NIDCR, and NIDDK. We leverage here this extensive experience and code base to propose a low-risk, highcapability solution to NIH Data Commons requirements. This solution is distinguished by its focus on user experience, end-to-end FAIRness, cloud platform independence, and modularity. In the following, we first describe, in Section 1, the cross-cutting Minimal Viable Product (MVP) that we will develop within the first 180 days to demonstrate and evaluate the effectiveness of our approach. This MVP emphasizes our solution?s innovation, showing how it allows researchers to seamlessly (and securely) access and query data from the three identified data sources (and others); access the entirety or subsets of those data (as well as other datasets) and move data among different locations (e.g., institutions, public and private cloud platforms, computing centers) for analysis or collaboration; track and verify data throughout the lifecycle, including provenance and attribution; and create, share, and analyze datasets on cloud platforms using reproducible and sharable scientific workflows. The fact that we can deliver this sophisticated functionality in just 180 days emphasizes the power of our technical approach. In Sections 2, 4, 5, 6, and 7, we describe in turn how we address important elements of Key Capabilities 2, 4, 5, 6, and 7, respectively. Here we show how our methods address individual DCPPC concerns such as security, privacy, identifiers, and workflows. Importantly, each element of our solution is cloud-agnostic and is designed to be interoperable with one another, other Commons services, related services such as ORCID and GitHub, and cloud services from public cloud providers. Each element has a well-defined and programmatically accessible REST API and software development kits (SDKs) that enable use by others. Each component has been applied on a substantial scale in biomedical settings, including three NIH data repositories and 100s of research institutions across the US and internationally: see KC4. We conclude with a note regarding FAIRness. The Data Commons aims to ensure that all data adhere to FAIR principles [14, 24]. To that we end, we aim to ensure that FAIR guidelines apply to all data and throughout their lifecycle. Data should be born FAIR and live FAIR, so as to promote reuse, collaboration, and reproducibility [11] at all scales, from laboratory and collaboration to research community?and across different cloud platforms, research computing centers, and other locations.

Agency
National Institute of Health (NIH)
Institute
Office of The Director, National Institutes of Health (OD)
Project #
3OT3OD025458-01S1
Application #
9672005
Study Section
Data Coordination, Mapping, and Modeling (DCMM)
Program Officer
Kutkat, Lora
Project Start
2017-09-30
Project End
2018-11-30
Budget Start
2017-09-30
Budget End
2018-11-30
Support Year
1
Fiscal Year
2018
Total Cost
Indirect Cost
Name
University of Chicago
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
005421136
City
Chicago
State
IL
Country
United States
Zip Code
60637