The Center for Big Data in Translational Genomics will pioneer common Application Programming Interfaces (APIs) for big genomics data in biomedicine. This will involve multiple groups in academia, medicine and industry, and interactions with a recently formed global alliance for responsible sharing of genomic and clinical data. The Center will create reference implementations that will drive API adoption, and that by coordination with industry will be readily deployable in the broadest range of commercial clouds, including those of Amazon, Google and Microsoft, as well as within private clouds. Along with APIs that drive standards, the Center will create a continuously operating benchmarking platform for methods of large-scale genomics analysis for worldwide use. This will establish the best-of-breed methods, and force collective improvement across big data genomics. The APIs and benchmarking efforts will create a rich infrastructure for genomics software developers. To make these underlying computational methods available to the wider biomedical community, the Center will develop large-scale genomics analysis tools on top of the big genomics data APIs, including tools for read mapping, variant analysis, transcript analysis, pathway analysis, and interactive data visualization, allowing researchers to routinely tackle data sets orders of magnitude larger than is currently possible. To ensure that the APIs and tools are developed and adapted over the course of the project to address the current and continually growing needs of biomedicine, the Center will pilot the APIs and tools in the context of a variety of driving projects, including the UK10K project in the area of population genetics and disease association research, the ICGC Pan Cancer analysis of 2,000 tumours in the area of large-scale cancer genomics, the I-SPY2 Breast cancer trial in the area of clinical trials, and the BeatAML omics-guided leukemia project in the area of clinical practice. This set of projects collectively represent some two petabytes of raw data and encompass a variety of uses of genomics in biomedicine, ensuring the software developed will be applicable to the broadest range of problems.
At present, most genomics data is locked up in medical center silos, each individually developing their own data representations and analysis methods. Without cross-center data exchange standardization and collective benchmarking of computational procedures for accuracy and efficiency, medical genomics will become locked into inadequate and incompatible legacy approaches. An open, international, competitive, modern software development approach to data sharing, benchmarking and new computational tool development is needed, driven by leading biomedical projects to ensure it addresses the most pressing