High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The continued goal of this project is to design and develop innovative compression-based algorithmic techniques for efficiently processing massive biological data. We will branch out beyond compressive search to address the imminent need to securely store and process large-scale genomic data in the cloud, as well as to gain insights from massive metagenomic data. The key underlying observation is that genomic data is highly structured, exhibiting high degrees of self-similarity. In our previous granting period, we exploited its high redundancy and low fractal dimension to enable scalable compressive storage and acceleration for search of sequence data as well as other biological data types relevant to structural bioinformatics and chemogenomics. In this renewal, we will continue to capitalize on the structure (i.e., compressibility) of genomic data to: (i) overcome privacy concerns that arise in sharing sensitive human data (e.g. on the cloud); (ii) address new challenges, beyond search, with metagenomic data; and (iii) seek to widen the adoption of the previous and newly-proposed compressive algorithms for industry, research, and clinical use. We will demonstrate the utility of our compressive techniques to the characterization of human genomic and metagenomic variation. We will collaborate with co-I Sahinalp's lab (Indiana University, Bloomington) on developing and applying these tools to high-throughput data sets including autism spectrum disorder (with Isaac Kohane and Evan Eichler) and cancer (with PCAWG, Pan Cancer Analysis of Whole Genomes), the microbiome (with Eric Alm and Jian Peng), as well as human variation analysis (GATK, with Eric Lander and Eric Banks). The broad, long-term goal is to apply our compressive approach to massive biological data sets to elucidate the still obscure molecular landscape of diseases. Successful completion of these aims will result in computational methods and tools that will significantly increase our ability to securely store, access and analyze massive data sets and will reveal fundamental aspects of genetic variation, as well as testable hypotheses for experimental investigations. Not only will all developed software be made publicly available, but as part of our integration aim, we will also ensure that the research community can make use of our innovations with minimal effort. Through our research collaborations, we will both build these tools and demonstrate their relevance to the characterization of human health and disease.

Public Health Relevance

Understanding massive genomic data from patients will empower both the development of microbiome therapeutics and insights into human disease variation, yet this task brings major scalability and privacy challenges. Here, we develop novel computational methods and tools that will fundamentally advance the state of the art in efficient and secure storage, access, and analysis of these rapidly expanding data sets.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM108348-06
Application #
9546755
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
2013-09-05
Project End
2020-08-31
Budget Start
2018-09-01
Budget End
2019-08-31
Support Year
6
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Massachusetts Institute of Technology
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
001425594
City
Cambridge
State
MA
Country
United States
Zip Code
Shajii, Ariya; Numanagi?, Ibrahim; Berger, Bonnie (2018) Latent Variable Model for Aligning Barcoded Short-Reads Improves Downstream Analyses. Res Comput Mol Biol 10812:280-282
Shajii, Ariya; Numanagi?, Ibrahim; Whelan, Christopher et al. (2018) Statistical Binning for Barcoded Reads Improves Downstream Analyses. Cell Syst 7:219-226.e5
Lin, Yen-Yi; Gawronski, Alexander; Hach, Faraz et al. (2018) Computational identification of micro-structural variations and their proteogenomic consequences in cancer. Bioinformatics 34:1672-1681
Ginart, Antonio A; Hui, Joseph; Zhu, Kaiyuan et al. (2018) Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 9:566
Cho, Hyunghoon; Wu, David J; Berger, Bonnie (2018) Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36:547-551
Numanagi?, Ibrahim; Maliki?, Salem; Ford, Michael et al. (2018) Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes. Nat Commun 9:828
Kalina, Jennifer L; Neilson, David S; Lin, Yen-Yi et al. (2017) Mutational Analysis of Gene Fusions Predicts Novel MHC Class I-Restricted T-Cell Epitopes and Immune Signatures in a Subset of Prostate Cancer. Clin Cancer Res 23:7596-7607
McPherson, Andrew W; Roth, Andrew; Ha, Gavin et al. (2017) ReMixT: clone-specific genomic structure estimation in cancer. Genome Biol 18:140
Shajii, Ariya; Yorukoglu, Deniz; William Yu, Yun et al. (2016) Fast genotyping of known SNPs through approximate k-mer matching. Bioinformatics 32:i538-i544
Yorukoglu, Deniz; Yu, Yun William; Peng, Jian et al. (2016) Compressive mapping for next-generation sequencing. Nat Biotechnol 34:374-6

Showing the most recent 10 out of 30 publications