Humans carry ten times more bacterial cells than human cells, and a hundred times more bacterial genes than the inherited human genome. Human microbes also hold secrets for maintaining health and preventing disease. For the last decade, a cultivation-independent metagenomics approach, in which all microorganisms in a sample are directly sequenced together, has been intensely applied to understand microbes' impact on human health. A new generation of sequencing technologies accelerated research, but left a vast amount of metagenomic sequencing data to be analyzed. Software and high-performance computing systems that could speed analysis are still lacking. The PI proposes to develop novel computational algorithms and cloud computing software to decipher terabytes of metagenomic sequencing data for studying the human microbiome. Experience from these pursuits will accelerate development of the proposed tools for better understanding the ecosystem in our bodies. Ultimately, this may contribute to better diagnosis, prevention, and treatment of disease. Furthermore, the proposed cloud computing algorithms and techniques could be adapted to many other applications demanding high computation complexity. A key proposal ingredient is offering graduate and undergraduate computer science students a unique opportunity for interdisciplinary research designing algorithms and software to solve biological problems.
Novel computational algorithms and a cloud computing software tool are proposed, to analyze large-scale metagenomic sequencing data to study the human microbiome. The project would feature Apache Spark, a cutting-edge, open-source cluster computing framework for large-scale data processing. It supports a rich set of high-level tools including scalable machine learning and graph processing libraries. The primary novelty is a cloud scalable de novo assembler, and the ability to compare assembled sequences to existing reference genomes using Spark libraries. This new approach will speed identification of novel genomes and composition of microbes from large metagenomic data. Most existing metagenomic analysis methods separately execute de novo sequence assembly and taxonomy classification with many existing reference genomes. Key technical innovations of the proposed work are (i) cloud computing algorithms enabling a fast and scalable metagenome assembler, (ii) taking assembled sequences directly for taxonomy to dramatically reduce computation time, and (iii) a cloud container package allowing researchers to analyze metagenomic data easily and cheaply. Providing a cloud container package with a simple Web interface will enable researchers to analyze their large-scale metagenomic sequence data readily and quickly for human health, biosurveillance, and pan-genomic analysis of microbiota.