High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The goal of this project is to design and develop innovative compression-based algorithmic techniques and publicly-available software for large-scale genomic sequence data sets. The key underlying observation is that most genomes currently being sequenced share much similarity with genomes that have already been collected. Thus, the amount of new sequence information is growing much more slowly than the total size of genomic sequence data sets. In very recent work, we have provided a proof-of-concept that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data, a methodological paradigm we term "compressive genomics." In this proposal we broaden the framework of compressive genomics to several additional application areas in which algorithmic advances are urgently needed in order to keep pace with the growth in both genomic and protein sequencing data. In particular, we will build a novel comprehensive framework for compressive representation and highly efficient downstream analysis of large-scale next-generation sequencing (NGS) data sets;this will significantly advance the state of the art and scale over existing algorithms as the volume of genomic data grows, thus meeting the challenge of the expected future acceleration of sequencing technologies. Additionally, we will develop advanced, compressively-accelerated algorithms and software for specific applications of current interest in bioinformatics and apply them to real large-scale 'omics'data sets to accelerate data analytics and lead to novel biological discoveries. Namely, we will collaborate with the Kohane lab on analysis of high-throughput gene expression and NGS data sets from patients with neurodevelopmental disorders, including Autism Spectrum Disorder and Parkinson's;the broad, long-term goal is to apply our compressive approach to such massive data sets to elucidate the still obscure molecular landscape of these diseases. Understanding massive 'omics'data from patients will empower both rational, targeted drug design and more intelligent disease management, yet their sheer enormity threatens to make the arising problems computationally infeasible. Here, we develop computational methods and tools that will fundamentally advance the state-of-the-art in storage, retrieval and analysis of these rapidly expanding data sets.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-N (52))
Program Officer
Wu, Mary Ann
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Massachusetts Institute of Technology
United States
Zip Code
Tucker, George; Price, Alkes L; Berger, Bonnie (2014) Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics 197:1045-9
Berger, Emily; Yorukoglu, Deniz; Peng, Jian et al. (2014) HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput Biol 10:e1003502
Lipson, Mark; Loh, Po-Ru; Patterson, Nick et al. (2014) Reconstructing Austronesian population history in Island Southeast Asia. Nat Commun 5:4689
Pickrell, Joseph K; Patterson, Nick; Loh, Po-Ru et al. (2014) Ancient west Eurasian ancestry in southern and eastern Africa. Proc Natl Acad Sci U S A 111:2632-7
Chindelevitch, Leonid; Trigg, Jason; Regev, Aviv et al. (2014) An exact arithmetic toolbox for a consistent and reproducible structural analysis of metabolic network models. Nat Commun 5:4893