High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The continued goal of this project is to design and develop innovative compression-based algorithmic techniques for efficiently processing massive biological data. We will branch out beyond compressive search to address the imminent need to securely store and process large-scale genomic data in the cloud, as well as to gain insights from massive metagenomic data. The key underlying observation is that genomic data is highly structured, exhibiting high degrees of self-similarity. In our previous granting period, we exploited its high redundancy and low fractal dimension to enable scalable compressive storage and acceleration for search of sequence data as well as other biological data types relevant to structural bioinformatics and chemogenomics. In this renewal, we will continue to capitalize on the structure (i.e., compressibility) of genomic data to: (i) overcome privacy concerns that arise in sharing sensitive human data (e.g. on the cloud); (ii) address new challenges, beyond search, with metagenomic data; and (iii) seek to widen the adoption of the previous and newly-proposed compressive algorithms for industry, research, and clinical use. We will demonstrate the utility of our compressive techniques to the characterization of human genomic and metagenomic variation. We will collaborate with co-I Sahinalp's lab (Indiana University, Bloomington) on developing and applying these tools to high-throughput data sets including autism spectrum disorder (with Isaac Kohane and Evan Eichler) and cancer (with PCAWG, Pan Cancer Analysis of Whole Genomes), the microbiome (with Eric Alm and Jian Peng), as well as human variation analysis (GATK, with Eric Lander and Eric Banks). The broad, long-term goal is to apply our compressive approach to massive biological data sets to elucidate the still obscure molecular landscape of diseases. Successful completion of these aims will result in computational methods and tools that will significantly increase our ability to securely store, access and analyze massive data sets and will reveal fundamental aspects of genetic variation, as well as testable hypotheses for experimental investigations. Not only will all developed software be made publicly available, but as part of our integration aim, we will also ensure that the research community can make use of our innovations with minimal effort. Through our research collaborations, we will both build these tools and demonstrate their relevance to the characterization of human health and disease.

Public Health Relevance

Understanding massive genomic data from patients will empower both the development of microbiome therapeutics and insights into human disease variation, yet this task brings major scalability and privacy challenges. Here, we develop novel computational methods and tools that will fundamentally advance the state of the art in efficient and secure storage, access, and analysis of these rapidly expanding data sets.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Massachusetts Institute of Technology
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
McPherson, Andrew W; Roth, Andrew; Ha, Gavin et al. (2017) ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 18:140
Simmons, Sean; Sahinalp, Cenk; Berger, Bonnie (2016) Enabling Privacy-Preserving GWASs in Heterogeneous Human Populations. Cell Syst 3:54-61
Shajii, Ariya; Yorukoglu, Deniz; William Yu, Yun et al. (2016) Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32:i538-i544
Yorukoglu, Deniz; Yu, Yun William; Peng, Jian et al. (2016) Compressive mapping for next-generation sequencing. Nat Biotechnol 34:374-6
Simmons, Sean; Berger, Bonnie (2016) Realizing privacy preserving genome-wide association studies. Bioinformatics 32:1293-300
Luo, Yunan; Zeng, Jianyang; Berger, Bonnie et al. (2016) Low-Density Locality-Sensitive Hashing Boosts Metagenomic Binning. Res Comput Mol Biol 9649:255-257
Berger, Bonnie; Daniels, Noah M; Yu, Y William (2016) Computational Biology in the 21st Century: Scaling with Compressive Algorithms. Commun ACM 59:72-80
Alberti, Claudio; Daniels, Noah; Hernaez, Mikel et al. (2016) An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values. Proc Data Compress Conf 2016:221-230
Simmons, Sean; Berger, Bonnie (2015) One Size Doesn't Fit All: Measuring Individual Privacy in Aggregate Genomic Data. Proc IEEE Symp Secur Priv Workshops 2015:41-49
Yu, Y William; Daniels, Noah M; Danko, David Christian et al. (2015) Entropy-scaling search of massive biological data. Cell Syst 1:130-140

Showing the most recent 10 out of 23 publications