High-throughput experimental technologies are generating increasingly massive and complex genomic sequence data sets. While these data hold the promise of uncovering entirely new biology, their sheer enormity threatens to make their interpretation computationally infeasible. The continued goal of this project is to design and develop innovative compression-based algorithmic techniques for efficiently processing massive biological data. We will branch out beyond compressive search to address the imminent need to securely store and process large-scale genomic data in the cloud, as well as to gain insights from massive metagenomic data. The key underlying observation is that genomic data is highly structured, exhibiting high degrees of self-similarity. In our previous granting period, we exploited its high redundancy and low fractal dimension to enable scalable compressive storage and acceleration for search of sequence data as well as other biological data types relevant to structural bioinformatics and chemogenomics. In this renewal, we will continue to capitalize on the structure (i.e., compressibility) of genomic data to: (i) overcome privacy concerns that arise in sharing sensitive human data (e.g. on the cloud); (ii) address new challenges, beyond search, with metagenomic data; and (iii) seek to widen the adoption of the previous and newly-proposed compressive algorithms for industry, research, and clinical use. We will demonstrate the utility of our compressive techniques to the characterization of human genomic and metagenomic variation. We will collaborate with co-I Sahinalp's lab (Indiana University, Bloomington) on developing and applying these tools to high-throughput data sets including autism spectrum disorder (with Isaac Kohane and Evan Eichler) and cancer (with PCAWG, Pan Cancer Analysis of Whole Genomes), the microbiome (with Eric Alm and Jian Peng), as well as human variation analysis (GATK, with Eric Lander and Eric Banks). The broad, long-term goal is to apply our compressive approach to massive biological data sets to elucidate the still obscure molecular landscape of diseases. Successful completion of these aims will result in computational methods and tools that will significantly increase our ability to securely store, access and analyze massive data sets and will reveal fundamental aspects of genetic variation, as well as testable hypotheses for experimental investigations. Not only will all developed software be made publicly available, but as part of our integration aim, we will also ensure that the research community can make use of our innovations with minimal effort. Through our research collaborations, we will both build these tools and demonstrate their relevance to the characterization of human health and disease.
Understanding massive genomic data from patients will empower both the development of microbiome therapeutics and insights into human disease variation, yet this task brings major scalability and privacy challenges. Here, we develop novel computational methods and tools that will fundamentally advance the state of the art in efficient and secure storage, access, and analysis of these rapidly expanding data sets.
Showing the most recent 10 out of 30 publications