Across all sectors of the Internet security, regulatory and privacy concerns coupled with bandwidth limitations are shifting Big Data technologies away from data transfer and toward algorithms that analyze data in situ. Biomedicine can take advantage of general technologies like Docker that have evolved to meet the needs of this shift. Already Docker, which allows a new level of lightweight portability for computer code, has significantly penetrated bioinformatics. No recent project illustrates this better than the large, international Pan Cancer Analyses of Whole Genomes (PCAWG, https://dcc.icgc.org/pcawg) collaboration. This effort saw the creation of common analytical pipelines that were uniformly applied to the whole genome sequences of over 2,800 cancer donors in 14 disparate HPC and cloud computing environments, making extensive use of Docker container technology. Critical to this effort was a rethink of the way algorithms were developed, packaged, and moved from environment to environment. The net outcome was the creation of the Dockstore project (http://dockstore.org). Dockstore facilitates the sharing and mobility of biomolecular analysis tools and workflows. It allows bioinformaticians to bring together individual tools and entire workflows packaged in portable Docker images (containers), described using either the Common Workflow Language (CWL) or Workflow Description Language (WDL). In this way, Dockstore standardizes computational analyses, making them precisely reproducible and runnable in any environment that supports Docker. The work proposed here supports the extended development, hardening, content addition, cloud integration and dissemination of the Dockstore to the wider biomedical research community. Most importantly, it supports full federation, under the auspices of the Global Alliance for Genomics Health (GA4GH), of the original Dockstore with other similar projects worldwide through an API that makes it possible to search for containers and workflows across a global network. The federated network will allow groups and individual projects to create not only individual analyses, but entire analysis repositories that are institutionally branded and shared with the rest of the world under a common GA4GH index and set of interoperability standards. The result will be an integrated network providing portable, securely signed, easily deployed workflows and tools covering the spectrum of biomedical analyses. It will make finding, testing and applying these analyses to new data far less time consuming and error prone, and reduce redundant reimplementation of key bioinformatic tasks. In contrast to the approach taken previously by influential efforts like Galaxy, which resulted in pushbutton methods that proved hard to scale to large datasets, the focus on portable, scalable workflow standards, which can be run within a variety of platforms, make this the right basis for a broad biomedical analysis commons.

Public Health Relevance

Modern biomedical datasets are huge, many in the terabytes to petabytes scale, and moving data around for analysis is an increasingly intractable problem. Here we describe a platform, the Dockstore, that enables scientists to portably, reproducibly, scalably and securely package up their analyses and send them to data rather than moving large datasets around. This project will place Dockstore in a federated network of globally distributed, searchable repositories that makes it much easier for researchers to find, share and use biomedical software and, as a result, will have a significant impact on the broader biomedical research community.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG009737-02
Application #
9552258
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sofia, Heidi J
Project Start
2017-08-29
Project End
2021-06-30
Budget Start
2018-07-01
Budget End
2019-06-30
Support Year
2
Fiscal Year
2018
Total Cost
Indirect Cost
Name
University of California Santa Cruz
Department
Engineering (All Types)
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
125084723
City
Santa Cruz
State
CA
Country
United States
Zip Code
95064