The Center for Big Data in Translational Genomics will pioneer common Application Programming Interfaces (APIs) for big genomics data in biomedicine. This will involve multiple groups in academia, medicine and industry, and interactions with a recently formed global alliance for responsible sharing of genomic and clinical data. The Center will create reference implementations that will drive API adoption, and that by coordination with industry will be readily deployable in the broadest range of commercial clouds, including those of Amazon, Google and Microsoft, as well as within private clouds. Along with APIs that drive standards, the Center will create a continuously operating benchmarking platform for methods of large-scale genomics analysis for worldwide use. This will establish the best-of-breed methods, and force collective improvement across big data genomics. The APIs and benchmarking efforts will create a rich infrastructure for genomics software developers. To make these underlying computational methods available to the wider biomedical community, the Center will develop large-scale genomics analysis tools on top of the big genomics data APIs, including tools for read mapping, variant analysis, transcript analysis, pathway analysis, and interactive data visualization, allowing researchers to routinely tackle data sets orders of magnitude larger than is currently possible. To ensure that the APIs and tools are developed and adapted over the course of the project to address the current and continually growing needs of biomedicine, the Center will pilot the APIs and tools in the context of a variety of driving projects, including the UK10K project in the area of population genetics and disease association research, the ICGC Pan Cancer analysis of 2,000 tumours in the area of large-scale cancer genomics, the I-SPY2 Breast cancer trial in the area of clinical trials, and the BeatAML omics-guided leukemia project in the area of clinical practice. This set of projects collectively represent some two petabytes of raw data and encompass a variety of uses of genomics in biomedicine, ensuring the software developed will be applicable to the broadest range of problems.

Public Health Relevance

At present, most genomics data is locked up in medical center silos, each individually developing their own data representations and analysis methods. Without cross-center data exchange standardization and collective benchmarking of computational procedures for accuracy and efficiency, medical genomics will become locked into inadequate and incompatible legacy approaches. An open, international, competitive, modern software development approach to data sharing, benchmarking and new computational tool development is needed, driven by leading biomedical projects to ensure it addresses the most pressing

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Specialized Center--Cooperative Agreements (U54)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-R (52))
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Santa Cruz
Santa Cruz
United States
Zip Code
Kozanitis, Christos; Patterson, David A (2016) GenAp: a distributed SQL interface for genomic data. BMC Bioinformatics 17:63
Gordon, David; Huddleston, John; Chaisson, Mark J P et al. (2016) Long-read sequence assembly of the gorilla genome. Science 352:aae0344
Haeussler, Maximilian; Schönig, Kai; Eckert, Hélène et al. (2016) Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol 17:148
Speir, Matthew L; Zweig, Ann S; Rosenbloom, Kate R et al. (2016) The UCSC Genome Browser database: 2016 update. Nucleic Acids Res 44:D717-25
Jain, Miten; Olsen, Hugh E; Paten, Benedict et al. (2016) The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 17:239
Yang, Shan; Cline, Melissa; Zhang, Can et al. (2016) DATA SHARING AND REPRODUCIBLE CLINICAL GENETIC TESTING: SUCCESSES AND CHALLENGES. Pac Symp Biocomput 22:166-176
Ip, Camilla L C; Loose, Matthew; Tyson, John R et al. (2015) MinION Analysis and Reference Consortium: Phase 1 data release and analysis. F1000Res 4:1075
Novak, Adam M; Rosen, Yohei; Haussler, David et al. (2015) Canonical, stable, general mapping using context schemes. Bioinformatics 31:3569-76
Philippakis, Anthony A; Azzariti, Danielle R; Beltran, Sergi et al. (2015) The Matchmaker Exchange: a platform for rare disease gene discovery. Hum Mutat 36:915-21
Paten, Benedict; Diekhans, Mark; Druker, Brian J et al. (2015) The NIH BD2K center for big data in translational genomics. J Am Med Inform Assoc 22:1143-7

Showing the most recent 10 out of 11 publications