High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The goal of this project is to design and develop innovative compression-based algorithmic techniques and publicly-available software for large-scale genomic sequence data sets. The key underlying observation is that most genomes currently being sequenced share much similarity with genomes that have already been collected. Thus, the amount of new sequence information is growing much more slowly than the total size of genomic sequence data sets. In very recent work, we have provided a proof-of-concept that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data, a methodological paradigm we term """"""""compressive genomics."""""""" In this proposal we broaden the framework of compressive genomics to several additional application areas in which algorithmic advances are urgently needed in order to keep pace with the growth in both genomic and protein sequencing data. In particular, we will build a novel comprehensive framework for compressive representation and highly efficient downstream analysis of large-scale next-generation sequencing (NGS) data sets;this will significantly advance the state of the art and scale over existing algorithms as the volume of genomic data grows, thus meeting the challenge of the expected future acceleration of sequencing technologies. Additionally, we will develop advanced, compressively-accelerated algorithms and software for specific applications of current interest in bioinformatics and apply them to real large-scale 'omics'data sets to accelerate data analytics and lead to novel biological discoveries. Namely, we will collaborate with the Kohane lab on analysis of high-throughput gene expression and NGS data sets from patients with neurodevelopmental disorders, including Autism Spectrum Disorder and Parkinson's;the broad, long-term goal is to apply our compressive approach to such massive data sets to elucidate the still obscure molecular landscape of these diseases. Understanding massive 'omics'data from patients will empower both rational, targeted drug design and more intelligent disease management, yet their sheer enormity threatens to make the arising problems computationally infeasible. Here, we develop computational methods and tools that will fundamentally advance the state-of-the-art in storage, retrieval and analysis of these rapidly expanding data sets.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-N (52))
Program Officer
Wu, Mary Ann
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Massachusetts Institute of Technology
United States
Zip Code
McPherson, Andrew W; Roth, Andrew; Ha, Gavin et al. (2017) ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 18:140
Simmons, Sean; Sahinalp, Cenk; Berger, Bonnie (2016) Enabling Privacy-Preserving GWASs in Heterogeneous Human Populations. Cell Syst 3:54-61
Shajii, Ariya; Yorukoglu, Deniz; William Yu, Yun et al. (2016) Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32:i538-i544
Yorukoglu, Deniz; Yu, Yun William; Peng, Jian et al. (2016) Compressive mapping for next-generation sequencing. Nat Biotechnol 34:374-6
Simmons, Sean; Berger, Bonnie (2016) Realizing privacy preserving genome-wide association studies. Bioinformatics 32:1293-300
Luo, Yunan; Zeng, Jianyang; Berger, Bonnie et al. (2016) Low-Density Locality-Sensitive Hashing Boosts Metagenomic Binning. Res Comput Mol Biol 9649:255-257
Berger, Bonnie; Daniels, Noah M; Yu, Y William (2016) Computational Biology in the 21st Century: Scaling with Compressive Algorithms. Commun ACM 59:72-80
Alberti, Claudio; Daniels, Noah; Hernaez, Mikel et al. (2016) An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values. Proc Data Compress Conf 2016:221-230
Simmons, Sean; Berger, Bonnie (2015) One Size Doesn't Fit All: Measuring Individual Privacy in Aggregate Genomic Data. Proc IEEE Symp Secur Priv Workshops 2015:41-49
Yu, Y William; Daniels, Noah M; Danko, David Christian et al. (2015) Entropy-scaling search of massive biological data. Cell Syst 1:130-140

Showing the most recent 10 out of 23 publications