The University of Florida is awarded a grant to develop an integrated suite of computational tools and statistical methods that allow researchers to analyze tens of millions of 16S rRNA sequences for microbial community analysis and to extract biologically relevant patterns from massive sequence data. These tools will be made available to the research community as a web application. Microbes play an essential role in processes as diverse as human health and biogeochemical activities critical to life in all environments on earth. Complex microbial communities, however, remain poorly characterized. Currently available pyrosequencing technologies can determine nucleotide sequences of millions of individual 16S rRNA molecules in a matter of hours, opening new windows to probe the hidden microbial world. However, such large amounts of data overwhelm existing computational resources and analytic methods. An interdisciplinary research plan will be used to develop computational algorithms to overcome the current computational hurdles of large-scale 16S rRNA based analysis of microbial communities. Advanced computational techniques will be used, including parallel computing, online learning, graphical modeling, supervised and unsupervised learning, and dimensionality reduction. The specific aims include: (1) to develop computational algorithms for large-scale taxonomy independent analysis; (2) to develop a collection of statistical and computational methods for comparative community analysis, including discriminant analysis, topology analysis and microbial network analysis; (3) to establish a web application based on the proposed algorithms to provide researchers with a complete package of tools to perform comparative microbial community analysis. The analytical approaches developed in the project will enable the derivation of microbial community diversity, quantitative disease-associated microbial profiles, environment-microbe and microbe-microbe interactions, and will identify and quantify sequences from unclassified species. Many of the analytic methods that will be developed have not been traditionally used to analyze microbial communities. Hence, this work represents a major transformation of the bioinformatics methodology used for investigating microbial communities, and has the potential to significantly advance discovery and understanding of the hidden microbial world.
The results from this study will be disseminated through publications, web applications, workshops and open source projects. Multiple impacts are anticipated. The development of new computational approaches capable of efficiently handling tens of millions of sequences currently generated by the third-generation sequencers can greatly improve the utility of existing pipelines. Open source projects will invite researchers from other fields such as mathematics and statistics to join this project. The close interactions between computer scientists and biologists that we propose to develop will create new teaching and training opportunities, and spark new algorithmic research with direct utility to biologists. Currently, there is a shortage of researchers with a deep understanding of both computer science and molecular biology. This project will provide two graduate students and a postdoctoral fellow with an excellent opportunity to receive intensive training in both areas. Software and results of this project will be available at http://plaza.ufl.edu/sunyijun/DProject.htm