Microbes play important roles in our everyday life. However, the majority of microbial species cannot be readily studied because they cannot be separated from their environments, i.e., microbial communities. Based on mixed DNA sequences directly extracted from microbial communities, metagenomics has emerged as an important field studying microbes and microbial communities. "Community structure identification" is one of the core problems in metagenomics. The goal is to identify species present in a specific microbial community and their relative abundance from the extracted mixed DNA sequences.
Because of the taxanomic bias towards culturable species in public databases and short DNA sequences from next generation sequencing platforms, there is an urgent need to create computational methods that can effectively address the "community structure identification" problem. In order to minimize the effect of the taxonomic bias and the short length of DNA sequences, the proposed research will create a statistical framework to bin reads that are likely from the same species based on k-tuple frequencies. With this framework, a series of algorithms will be designed to infer the community structures by integrating large-scale genomic sequences and their annotations. The research activities will be evaluated based on both simulated and experimental data. An accompanying software package will also be developed and released to the research community for free. The proposed methods and tools will help lay the foundation for further studying the microbes towards significantly advancing the scientific understanding of microbes and microbial communities.
This project will provide research experience for minority high school students and undergraduates. It will also educate undergraduates and graduates in metagenomics through curriculum development, seminars, mentoring activity, and annual research symposiums. In addition, this project will disseminate the research results on metagenomics through publications, conference presentations, free software development, and others. Finally, by attacking a core problem in metagenomics and providing a variety of accurate inferences, this research will greatly advance the current knowledge of microbes and microbial communities.