Compressive genomics for large omics data sets: Algorithms applications & tools

Berger, Bonnie

Abstract

High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The goal of this project is to design and develop innovative compression-based algorithmic techniques and publicly-available software for large-scale genomic sequence data sets. The key underlying observation is that most genomes currently being sequenced share much similarity with genomes that have already been collected. Thus, the amount of new sequence information is growing much more slowly than the total size of genomic sequence data sets. In very recent work, we have provided a proof-of-concept that this redundancy can be exploited by compressing sequence data in such a way as to allow direct computation on the compressed data, a methodological paradigm we term compressive genomics. In this proposal we broaden the framework of compressive genomics to several additional application areas in which algorithmic advances are urgently needed in order to keep pace with the growth in both genomic and protein sequencing data. In particular, we will build a novel comprehensive framework for compressive representation and highly efficient downstream analysis of large-scale next-generation sequencing (NGS) data sets; this will significantly advance the state of the art and scale over existing algorithms as the volume of genomic data grows, thus meeting the challenge of the expected future acceleration of sequencing technologies. Additionally, we will develop advanced, compressively-accelerated algorithms and software for specific applications of current interest in bioinformatics and apply them to real large-scale 'omics' data sets to accelerate data analytics and lead to novel biological discoveries. Namely, we will collaborate with the Kohane lab on analysis of high-throughput gene expression and NGS data sets from patients with neurodevelopmental disorders, including Autism Spectrum Disorder and Parkinson's; the broad, long-term goal is to apply our compressive approach to such massive data sets to elucidate the still obscure molecular landscape of these diseases. Understanding massive 'omics' data from patients will empower both rational, targeted drug design and more intelligent disease management, yet their sheer enormity threatens to make the arising problems computationally infeasible. Here, we develop computational methods and tools that will fundamentally advance the state-of-the-art in storage, retrieval and analysis of these rapidly expanding data sets.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 5R01GM108348-03
Application #: 8849927
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Ravichandran, Veerasamy

Project Start: 2013-09-05
Project End: 2016-05-31
Budget Start: 2015-06-01
Budget End: 2016-05-31
Support Year: 3
Fiscal Year: 2015
Total Cost
Indirect Cost

Institution

Name: Massachusetts Institute of Technology
Department
Type
DUNS #: 001425594

City: Cambridge
State: MA
Country: United States
Zip Code

Related projects


NIH 2019 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2018 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2017 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2016 R01 GM	Compressive Genomics for Large Omics Data Sets: Algorithms, Applications and Tools Berger, Bonnie / Massachusetts Institute of Technology	$372,014
NIH 2015 R01 GM	Compressive genomics for large omics data sets: Algorithms applications & tools Berger, Bonnie / Massachusetts Institute of Technology
NIH 2014 R01 GM	Compressive genomics for large omics data sets: Algorithms applications &tools Berger, Bonnie / Massachusetts Institute of Technology	$213,200
NIH 2013 R01 GM	Compressive genomics for large omics data sets: Algorithms applications &tools Berger, Bonnie / Massachusetts Institute of Technology	$217,893

Publications

Lin, Yen-Yi; Gawronski, Alexander; Hach, Faraz et al. (2018) Computational identification of micro-structural variations and their proteogenomic consequences in cancer. Bioinformatics 34:1672-1681

Ginart, Antonio A; Hui, Joseph; Zhu, Kaiyuan et al. (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566

Cho, Hyunghoon; Wu, David J; Berger, Bonnie (2018) Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36:547-551

Numanagi?, Ibrahim; Maliki?, Salem; Ford, Michael et al. (2018) Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. Nat Commun 9:828

Shajii, Ariya; Numanagi?, Ibrahim; Berger, Bonnie (2018) Latent Variable Model for Aligning Barcoded Short-Reads Improves Downstream Analyses. Res Comput Mol Biol 10812:280-282

Shajii, Ariya; Numanagi?, Ibrahim; Whelan, Christopher et al. (2018) Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Syst 7:219-226.e5

Kalina, Jennifer L; Neilson, David S; Lin, Yen-Yi et al. (2017) Mutational Analysis of Gene Fusions Predicts Novel MHC Class I-Restricted T-Cell Epitopes and Immune Signatures in a Subset of Prostate Cancer. Clin Cancer Res 23:7596-7607

McPherson, Andrew W; Roth, Andrew; Ha, Gavin et al. (2017) ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 18:140

Shajii, Ariya; Yorukoglu, Deniz; William Yu, Yun et al. (2016) Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32:i538-i544

Yorukoglu, Deniz; Yu, Yun William; Peng, Jian et al. (2016) Compressive mapping for next-generation sequencing. Nat Biotechnol 34:374-6

Showing the most recent 10 out of 30 publications

Comments

Be the first to comment on Bonnie Berger's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: