In our initial LOI response to the NIH Research Opportunity Announcement (ROA) RM-17-026, we identified four Key Capability areas where we can make meaningful contributions for creating a NIH Data Commons Platform (DCP), namely, in the Science Use Case (KC#8), Workspaces for Computation (KC#5), Open Standard APIs (KC#3), and the Cloud Agnostic Architectures and Frameworks (KC#4) areas. In our LOI, we summarized how we could contribute well-tested, open source, component technologies to the DCP effort and bring our strong record of collaborating nationally and internationally with other researchers to the NIH Data Commons Pilot Phase Consortium (DCPPC). Here we address in detail the 180 days of work we would undertake for Stage 1 for delivering a Minimum Viable Product (MVP) (Figure 1) that will demonstrate the key functionality for joining and analysing data from the TOPMed, GTEx, and MODs databases that make use of a common, standards-based API and analytic tools to demonstrate a scientifically important use case. For our specific Scientific Use Case ( Key Capability #8 ) deliverable, we plan to analyze TOPMed, GTEx, MODs data to answer an urgent scientific question in the precision medicine era: where do ethnic differences in single and multimodal biological and clinical sets of variables allow us to better understand pathobiology and their impact on diagnostic and therapeutic planning for these diverse populations? The domain expertise needed to answer this question in a reproducible manner includes our investigators? ability to operate collaboratively and efficiently using our infrastructure for sharing analysis notebooks, interactive visual exploration tools, and accessing sensitive, private data across multi-vendor cloud-hosted sites. Key Capabilities #3 , #4 , and #5 provide the infrastructure for tackling the aforementioned pathobiology question. Specifically, our API Capability (#3) includes the BD2K PIC-SURE API, the Sync for Science project and UDN Fileservice API, which allow us to combine multiple data modalities (e.g. genomic, clinical, environmental, and social web) in a standard web protocol. These APIs can be run and reproducibly rerun, from within multiple languages in the Jupyter notebooks that are part of our Workspaces for Computation Capability (#5). Researchers will be authenticated and authorized via standardized OAuth 2.0 protocols (#3 and #4) across multiple HIPAA- and FISMA-compliant, hosted server instances across multiple cloud vendors including AWS, Azure, Google and IBM, that are part of our Cloud Agnostic Architecture and Frameworks Capability (#4). We recognize that NIH will work with a range of vendors for risk reduction and resilience just as we have and, accordingly, we will architect all of our contributions to the NIH Data Commons to be vendor-agnostic. Prior and current projects have demonstrated our infrastructure capabilities in national-scaled implementations, including the Undiagnosed Diseases Network, PCORI, and the ACT Network. During the 180-day planning phase (Stage 1), we will pay close attention to what the other DCPPC partners are creating for their respective MVPs to see how we might best contribute to creating a robust NIH Data Commons Platform, including opportunistic collaborations where DCPPC deliverables overlap with our own MVP deliverables.

National Institute of Health (NIH)
Office of The Director, National Institutes of Health (OD)
Project #
Application #
Study Section
Data Coordination, Mapping, and Modeling (DCMM)
Program Officer
Kutkat, Lora
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard Medical School
Schools of Medicine
United States
Zip Code
Diao, James A; Kohane, Isaac S; Manrai, Arjun K (2018) Biomedical informatics and machine learning for clinical genomics. Hum Mol Genet 27:R29-R34