The life sciences are in the midst of a data revolution. Cheap and accurate genome sequencing is a reality, high-resolution imaging is becoming routine, and clinical data is increasingly stored in machine-readable formats. These breakthroughs have brought us to the threshold of a new era in biomedicine, one where the data sciences hold the potential to propel our understanding and treatment of human disease. Achieving this potential, however, will require creating software platforms that can support storing, sharing, and analyzing data at unlimited scale. In this application, we propose to address this unmet need by bringing together three groups - the University of Chicago, the Broad Institute, and the University of California at Santa Cruz - each with a strong track record of developing production-grade software platforms to support flagship scientific efforts, including the All of Us Cohort Program, the Genome Data Commons (GDC) and its affiliated NCI Cloud Pilots program, and the Human Cell Atlas Data Coordination Platform (HCA DCP). Our goal is to align and integrate our individual efforts at building data platforms, in order to build a cohesive environment that can serve the needs of the NIH Data Commons and beyond. Because these platforms were each developed to fulfill differing use cases, there is currently far more complementarity than overlap between them. For example, Dr. Grossman has extensive expertise in running a hybrid cloud at scale to support the needs of the GDC; this provides cost benefits around data transport and egress that would be invaluable to the NIH Data Commons. Similarly, Dr. Philippakis has developed a cloud-based model of collaborative workspaces (FireCloud) and software for management of secondary data use restrictions (DUOS), and Dr. Paten has long been a leader in developing and implementing standardized APIs as part of the GA4GH. It is this complementarity that motivates us to integrate our efforts. In the sections below, we present our plans for creating the Commons Alliance Platform. In addition to having a unified technical vision for what is needed, we are also aligned around a core set of guiding principles: (1) Open-source - All the software we develop, from user interfaces down to cloud metal, is open-source. This includes not only the software that would be funded via this awarding mechanism, but all software developed and deployed by our team. (2) Modular and interoperable - A design principle of all complex software undertakings is separation of concerns, i.e. the notion that there should be a clean division between architectural components, each encapsulated by well-defined interfaces. We are committed to building modular and interoperable software and, in doing so, encouraging the creation of an ecosystem around them. (3) Standards-driven - Our team is committed to creating and utilizing standardized APIs and data formats. We have been leaders in GA4GH since its founding, chairing various working groups and driver projects. (4) Healthy Competition - Our consortium's philosophy is to collaborate on APIs to support interoperability, but compete on implementation to encourage creativity and diversity. (5) Diversity of data types - We have expertise in multiple data types beyond molecular profiling. In particular, a key goal of All of Us is to collect extensive clinical data in the form of participant-provided data and medical records. Similarly, through the Brain Health Commons, Dr. Grossman will be managing clinical and imaging data. These capabilities will be invaluable as the Commons expands to include additional data types. (6) Driven by scientific use cases - Our consortium includes many leading scientists, including PIs on awards for model organism databases, GTEx, and TOPMed. We will leverage their insights via driving use cases to ensure that our software enables flagship scientific investigations.

Agency
National Institute of Health (NIH)
Institute
National Heart, Lung, and Blood Institute (NHLBI)
Project #
3OT3HL142481-01S2
Application #
10001102
Study Section
Data Coordination, Mapping, and Modeling (DCMM)
Program Officer
Kaltman, Jonathan R
Project Start
2017-09-30
Project End
2020-09-28
Budget Start
2019-09-27
Budget End
2020-09-28
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of California Santa Cruz
Department
Type
DUNS #
125084723
City
Santa Cruz
State
CA
Country
United States
Zip Code
95064
Grossman, Robert L (2018) Progress Toward Cancer Data Ecosystems. Cancer J 24:126-130