Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools

Berger, Bonnie

Abstract

High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The continued goal of this project is to design and develop innovative compression-based algorithmic techniques for efficiently processing massive biological data. We will branch out beyond compressive search to address the imminent need to securely store and process large-scale genomic data in the cloud, as well as to gain insights from massive metagenomic data. The key underlying observation is that genomic data is highly structured, exhibiting high degrees of self-similarity. In our previous granting period, we exploited its high redundancy and low fractal dimension to enable scalable compressive storage and acceleration for search of sequence data as well as other biological data types relevant to structural bioinformatics and chemogenomics. In this renewal, we will continue to capitalize on the structure (i.e., compressibility) of genomic data to: (i) overcome privacy concerns that arise in sharing sensitive human data (e.g. on the cloud); (ii) address new challenges, beyond search, with metagenomic data; and (iii) seek to widen the adoption of the previous and newly-proposed compressive algorithms for industry, research, and clinical use. We will demonstrate the utility of our compressive techniques to the characterization of human genomic and metagenomic variation. We will collaborate with co-I Sahinalp's lab (Indiana University, Bloomington) on developing and applying these tools to high-throughput data sets including autism spectrum disorder (with Isaac Kohane and Evan Eichler) and cancer (with PCAWG, Pan Cancer Analysis of Whole Genomes), the microbiome (with Eric Alm and Jian Peng), as well as human variation analysis (GATK, with Eric Lander and Eric Banks). The broad, long-term goal is to apply our compressive approach to massive biological data sets to elucidate the still obscure molecular landscape of diseases. Successful completion of these aims will result in computational methods and tools that will significantly increase our ability to securely store, access and analyze massive data sets and will reveal fundamental aspects of genetic variation, as well as testable hypotheses for experimental investigations. Not only will all developed software be made publicly available, but as part of our integration aim, we will also ensure that the research community can make use of our innovations with minimal effort. Through our research collaborations, we will both build these tools and demonstrate their relevance to the characterization of human health and disease.

Public Health Relevance

Understanding massive genomic data from patients will empower both the development of microbiome therapeutics and insights into human disease variation, yet this task brings major scalability and privacy challenges. Here, we develop novel computational methods and tools that will fundamentally advance the state of the art in efficient and secure storage, access, and analysis of these rapidly expanding data sets.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 5R01GM108348-05
Application #: 9354503
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Ravichandran, Veerasamy

Project Start: 2013-09-05
Project End: 2020-08-31
Budget Start: 2017-09-01
Budget End: 2018-08-31
Support Year: 5
Fiscal Year: 2017
Total Cost
Indirect Cost

Institution

Name: Massachusetts Institute of Technology
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 001425594

City: Cambridge
State: MA
Country: United States
Zip Code: 02142

Related projects


NIH 2019 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2018 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2017 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2016 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology	$372,014
NIH 2015 R01 GM	Compressive genomics for large omics data sets: Algorithms applications & tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2014 R01 GM	Compressive genomics for large omics data sets: Algorithms applications &tools Berger, Bonnie / Massachusetts Institute of Technology	$213,200
NIH 2013 R01 GM	Compressive genomics for large omics data sets: Algorithms applications &tools Berger, Bonnie / Massachusetts Institute of Technology	$217,893

Publications

Cho, Hyunghoon; Wu, David J; Berger, Bonnie (2018) Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36:547-551

Numanagi?, Ibrahim; Maliki?, Salem; Ford, Michael et al. (2018) Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. Nat Commun 9:828

Shajii, Ariya; Numanagi?, Ibrahim; Berger, Bonnie (2018) Latent Variable Model for Aligning Barcoded Short-Reads Improves Downstream Analyses. Res Comput Mol Biol 10812:280-282

Shajii, Ariya; Numanagi?, Ibrahim; Whelan, Christopher et al. (2018) Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Syst 7:219-226.e5

Lin, Yen-Yi; Gawronski, Alexander; Hach, Faraz et al. (2018) Computational identification of micro-structural variations and their proteogenomic consequences in cancer. Bioinformatics 34:1672-1681

Ginart, Antonio A; Hui, Joseph; Zhu, Kaiyuan et al. (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566

Kalina, Jennifer L; Neilson, David S; Lin, Yen-Yi et al. (2017) Mutational Analysis of Gene Fusions Predicts Novel MHC Class I-Restricted T-Cell Epitopes and Immune Signatures in a Subset of Prostate Cancer. Clin Cancer Res 23:7596-7607

McPherson, Andrew W; Roth, Andrew; Ha, Gavin et al. (2017) ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 18:140

Shajii, Ariya; Yorukoglu, Deniz; William Yu, Yun et al. (2016) Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32:i538-i544

Yorukoglu, Deniz; Yu, Yun William; Peng, Jian et al. (2016) Compressive mapping for next-generation sequencing. Nat Biotechnol 34:374-6

Showing the most recent 10 out of 30 publications

Comments

Be the first to comment on Bonnie Berger's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: