High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The goal of this project is to design and develop innovative compression-based algorithmic techniques and publicly-available software for large-scale genomic sequence data sets. The key underlying observation is that most genomes currently being sequenced share much similarity with genomes that have already been collected. Thus, the amount of new sequence information is growing much more slowly than the total size of genomic sequence data sets. In very recent work, we have provided a proof-of-concept that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data, a methodological paradigm we term compressive genomics. In this proposal we broaden the framework of compressive genomics to several additional application areas in which algorithmic advances are urgently needed in order to keep pace with the growth in both genomic and protein sequencing data. In particular, we will build a novel comprehensive framework for compressive representation and highly efficient downstream analysis of large-scale next-generation sequencing (NGS) data sets; this will significantly advance the state of the art and scale over existing algorithms as the volume of genomic data grows, thus meeting the challenge of the expected future acceleration of sequencing technologies. Additionally, we will develop advanced, compressively-accelerated algorithms and software for specific applications of current interest in bioinformatics and apply them to real large-scale 'omics' data sets to accelerate data analytics and lead to novel biological discoveries. Namely, we will collaborate with the Kohane lab on analysis of high-throughput gene expression and NGS data sets from patients with neurodevelopmental disorders, including Autism Spectrum Disorder and Parkinson's; the broad, long-term goal is to apply our compressive approach to such massive data sets to elucidate the still obscure molecular landscape of these diseases. Understanding massive 'omics' data from patients will empower both rational, targeted drug design and more intelligent disease management, yet their sheer enormity threatens to make the arising problems computationally infeasible. Here, we develop computational methods and tools that will fundamentally advance the state-of-the-art in storage, retrieval and analysis of these rapidly expanding data sets.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Massachusetts Institute of Technology
United States
Zip Code
Shajii, Ariya; Numanagi?, Ibrahim; Berger, Bonnie (2018) Latent Variable Model for Aligning Barcoded Short-Reads Improves Downstream Analyses. Res Comput Mol Biol 10812:280-282
Shajii, Ariya; Numanagi?, Ibrahim; Whelan, Christopher et al. (2018) Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Syst 7:219-226.e5
Lin, Yen-Yi; Gawronski, Alexander; Hach, Faraz et al. (2018) Computational identification of micro-structural variations and their proteogenomic consequences in cancer. Bioinformatics 34:1672-1681
Ginart, Antonio A; Hui, Joseph; Zhu, Kaiyuan et al. (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566
Cho, Hyunghoon; Wu, David J; Berger, Bonnie (2018) Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36:547-551
Numanagi?, Ibrahim; Maliki?, Salem; Ford, Michael et al. (2018) Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. Nat Commun 9:828
Kalina, Jennifer L; Neilson, David S; Lin, Yen-Yi et al. (2017) Mutational Analysis of Gene Fusions Predicts Novel MHC Class I-Restricted T-Cell Epitopes and Immune Signatures in a Subset of Prostate Cancer. Clin Cancer Res 23:7596-7607
McPherson, Andrew W; Roth, Andrew; Ha, Gavin et al. (2017) ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 18:140
Shajii, Ariya; Yorukoglu, Deniz; William Yu, Yun et al. (2016) Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32:i538-i544
Yorukoglu, Deniz; Yu, Yun William; Peng, Jian et al. (2016) Compressive mapping for next-generation sequencing. Nat Biotechnol 34:374-6

Showing the most recent 10 out of 30 publications