High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The continued goal of this project is to design and develop innovative compression-based algorithmic techniques for efficiently processing massive biological data. We will branch out beyond compressive search to address the imminent need to securely store and process large-scale genomic data in the cloud, as well as to gain insights from massive metagenomic data. The key underlying observation is that genomic data is highly structured, exhibiting high degrees of self-similarity. In our previous granting period, we exploited its high redundancy and low fractal dimension to enable scalable compressive storage and acceleration for search of sequence data as well as other biological data types relevant to structural bioinformatics and chemogenomics. In this renewal, we will continue to capitalize on the structure (i.e., compressibility) of genomic data to: (i) overcome privacy concerns that arise in sharing sensitive human data (e.g. on the cloud); (ii) address new challenges, beyond search, with metagenomic data; and (iii) seek to widen the adoption of the previous and newly-proposed compressive algorithms for industry, research, and clinical use. We will demonstrate the utility of our compressive techniques to the characterization of human genomic and metagenomic variation. We will collaborate with co-I Sahinalp's lab (Indiana University, Bloomington) on developing and applying these tools to high-throughput data sets including autism spectrum disorder (with Isaac Kohane and Evan Eichler) and cancer (with PCAWG, Pan Cancer Analysis of Whole Genomes), the microbiome (with Eric Alm and Jian Peng), as well as human variation analysis (GATK, with Eric Lander and Eric Banks). The broad, long-term goal is to apply our compressive approach to massive biological data sets to elucidate the still obscure molecular landscape of diseases. Successful completion of these aims will result in computational methods and tools that will significantly increase our ability to securely store, access and analyze massive data sets and will reveal fundamental aspects of genetic variation, as well as testable hypotheses for experimental investigations. Not only will all developed software be made publicly available, but as part of our integration aim, we will also ensure that the research community can make use of our innovations with minimal effort. Through our research collaborations, we will both build these tools and demonstrate their relevance to the characterization of human health and disease.

Public Health Relevance

Understanding massive genomic data from patients will empower both the development of microbiome therapeutics and insights into human disease variation, yet this task brings major scalability and privacy challenges. Here, we develop novel computational methods and tools that will fundamentally advance the state of the art in efficient and secure storage, access, and analysis of these rapidly expanding data sets.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Massachusetts Institute of Technology
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Cho, Hyunghoon; Wu, David J; Berger, Bonnie (2018) Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36:547-551
Numanagi?, Ibrahim; Maliki?, Salem; Ford, Michael et al. (2018) Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. Nat Commun 9:828
Shajii, Ariya; Numanagi?, Ibrahim; Berger, Bonnie (2018) Latent Variable Model for Aligning Barcoded Short-Reads Improves Downstream Analyses. Res Comput Mol Biol 10812:280-282
Shajii, Ariya; Numanagi?, Ibrahim; Whelan, Christopher et al. (2018) Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Syst 7:219-226.e5
Lin, Yen-Yi; Gawronski, Alexander; Hach, Faraz et al. (2018) Computational identification of micro-structural variations and their proteogenomic consequences in cancer. Bioinformatics 34:1672-1681
Ginart, Antonio A; Hui, Joseph; Zhu, Kaiyuan et al. (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566
Kalina, Jennifer L; Neilson, David S; Lin, Yen-Yi et al. (2017) Mutational Analysis of Gene Fusions Predicts Novel MHC Class I-Restricted T-Cell Epitopes and Immune Signatures in a Subset of Prostate Cancer. Clin Cancer Res 23:7596-7607
McPherson, Andrew W; Roth, Andrew; Ha, Gavin et al. (2017) ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 18:140
Luo, Yunan; Zeng, Jianyang; Berger, Bonnie et al. (2016) Low-Density Locality-Sensitive Hashing Boosts Metagenomic Binning. Res Comput Mol Biol 9649:255-257
Simmons, Sean; Berger, Bonnie (2016) Realizing privacy preserving genome-wide association studies. Bioinformatics 32:1293-300

Showing the most recent 10 out of 30 publications