The Center for Big Data in Translational Genomics will pioneer common Application Programming Interfaces (APIs) for big genomics data in biomedicine. This will involve multiple groups in academia, medicine and industry, and interactions with a recently formed global alliance for responsible sharing of genomic and clinical data. The Center will create reference implementations that will drive API adoption, and that by coordination with industry will be readily deployable in the broadest range of commercial clouds, including those of Amazon, Google and Microsoft, as well as within private clouds. Along with APIs that drive standards, the Center will create a continuously operating benchmarking platform for methods of large-scale genomics analysis for worldwide use. This will establish the best-of-breed methods, and force collective improvement across big data genomics. The APIs and benchmarking efforts will create a rich infrastructure for genomics software developers. To make these underlying computational methods available to the wider biomedical community, the Center will develop large-scale genomics analysis tools on top of the big genomics data APIs, including tools for read mapping, variant analysis, transcript analysis, pathway analysis, and interactive data visualization, allowing researchers to routinely tackle data sets orders of magnitude larger than is currently possible. To ensure that the APIs and tools are developed and adapted over the course of the project to address the current and continually growing needs of biomedicine, the Center will pilot the APIs and tools in the context of a variety of driving projects, including the UK10K project in the area of population genetics and disease association research, the ICGC Pan Cancer analysis of 2,000 tumours in the area of large-scale cancer genomics, the I-SPY2 Breast cancer trial in the area of clinical trials, and the BeatAML omics-guided leukemia project in the area of clinical practice. This set of projects collectively represent some two petabytes of raw data and encompass a variety of uses of genomics in biomedicine, ensuring the software developed will be applicable to the broadest range of problems.

Public Health Relevance

At present, most genomics data is locked up in medical center silos, each individually developing their own data representations and analysis methods. Without cross-center data exchange standardization and collective benchmarking of computational procedures for accuracy and efficiency, medical genomics will become locked into inadequate and incompatible legacy approaches. An open, international, competitive, modern software development approach to data sharing, benchmarking and new computational tool development is needed, driven by leading biomedical projects to ensure it addresses the most pressing

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Specialized Center--Cooperative Agreements (U54)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-R (52))
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Santa Cruz
Santa Cruz
United States
Zip Code
Toor, Jugmohit S; Rao, Arjun A; McShan, Andrew C et al. (2018) A Recurrent Mutation in Anaplastic Lymphoma Kinase with Distinct Neoepitope Conformations. Front Immunol 9:99
Kronenberg, Zev N; Fiddes, Ian T; Gordon, David et al. (2018) High-resolution comparative analysis of great ape genomes. Science 360:
Jain, Miten; Olsen, Hugh E; Turner, Daniel J et al. (2018) Linear assembly of a human centromere on the Y chromosome. Nat Biotechnol 36:321-323
Garrison, Erik; Sirén, Jouni; Novak, Adam M et al. (2018) Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 36:875-879
Ellrott, Kyle; Bailey, Matthew H; Saksena, Gordon et al. (2018) Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst 6:271-281.e7
Fiddes, Ian T; Armstrong, Joel; Diekhans, Mark et al. (2018) Comparative Annotation Toolkit (CAT)-simultaneous clade and personal genome annotation. Genome Res 28:1029-1038
Paten, Benedict; Eizenga, Jordan M; Rosen, Yohei M et al. (2018) Superbubbles, Ultrabubbles, and Cacti. J Comput Biol 25:649-663
Tyson, John R; O'Neil, Nigel J; Jain, Miten et al. (2018) MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans reference genome. Genome Res 28:266-274
Jain, Miten; Koren, Sergey; Miga, Karen H et al. (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36:338-345
Computational Pan-Genomics Consortium (2018) Computational pan-genomics: status, promises and challenges. Brief Bioinform 19:118-135

Showing the most recent 10 out of 76 publications