High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The goal of this project is to design and develop innovative compression-based algorithmic techniques and publicly-available software for large-scale genomic sequence data sets. The key underlying observation is that most genomes currently being sequenced share much similarity with genomes that have already been collected. Thus, the amount of new sequence information is growing much more slowly than the total size of genomic sequence data sets. In very recent work, we have provided a proof-of-concept that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data, a methodological paradigm we term "compressive genomics." In this proposal we broaden the framework of compressive genomics to several additional application areas in which algorithmic advances are urgently needed in order to keep pace with the growth in both genomic and protein sequencing data. In particular, we will build a novel comprehensive framework for compressive representation and highly efficient downstream analysis of large-scale next-generation sequencing (NGS) data sets;this will significantly advance the state of the art and scale over existing algorithms as the volume of genomic data grows, thus meeting the challenge of the expected future acceleration of sequencing technologies. Additionally, we will develop advanced, compressively-accelerated algorithms and software for specific applications of current interest in bioinformatics and apply them to real large-scale 'omics'data sets to accelerate data analytics and lead to novel biological discoveries. Namely, we will collaborate with the Kohane lab on analysis of high-throughput gene expression and NGS data sets from patients with neurodevelopmental disorders, including Autism Spectrum Disorder and Parkinson's;the broad, long-term goal is to apply our compressive approach to such massive data sets to elucidate the still obscure molecular landscape of these diseases. Understanding massive 'omics'data from patients will empower both rational, targeted drug design and more intelligent disease management, yet their sheer enormity threatens to make the arising problems computationally infeasible. Here, we develop computational methods and tools that will fundamentally advance the state-of-the-art in storage, retrieval and analysis of these rapidly expanding data sets.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM108348-02
Application #
8730209
Study Section
Special Emphasis Panel (ZRG1-BST-N (52))
Program Officer
Wu, Mary Ann
Project Start
2013-09-05
Project End
2016-05-31
Budget Start
2014-06-01
Budget End
2015-05-31
Support Year
2
Fiscal Year
2014
Total Cost
$213,200
Indirect Cost
$62,667
Name
Massachusetts Institute of Technology
Department
Type
DUNS #
001425594
City
Cambridge
State
MA
Country
United States
Zip Code
02139
Tucker, George; Price, Alkes L; Berger, Bonnie (2014) Improving the power of GWAS and avoiding confounding from population stratification with PC-Select. Genetics 197:1045-9
Berger, Emily; Yorukoglu, Deniz; Peng, Jian et al. (2014) HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput Biol 10:e1003502
Lipson, Mark; Loh, Po-Ru; Patterson, Nick et al. (2014) Reconstructing Austronesian population history in Island Southeast Asia. Nat Commun 5:4689
Pickrell, Joseph K; Patterson, Nick; Loh, Po-Ru et al. (2014) Ancient west Eurasian ancestry in southern and eastern Africa. Proc Natl Acad Sci U S A 111:2632-7
Chindelevitch, Leonid; Trigg, Jason; Regev, Aviv et al. (2014) An exact arithmetic toolbox for a consistent and reproducible structural analysis of metabolic network models. Nat Commun 5:4893