Metagenomics is a powerful molecular approach in which an environmental sample containing genetic material from an entire community of organisms is analyzed as a whole, without requiring individual organisms to be isolated or cultured in the laboratory. This has great relevance to medical applications, as many microorganisms, both harmful and beneficial to humans, operate as closely-knit communities. Metagenomic analysis of DNA sequences obtained is complicated by a number of factors. First, the nucleotide sequences are available only in the form of "reads," which may, because of their short length, be difficult to assign to species. Second, environmental samples contain many organisms that are neither known nor previously characterized. Current metagenomic analysis programs classify sequence reads on the basis of crude measures of similarity between the unknown sequence and those currently available in the databanks. A shortcoming of the current approach is that the assignments are not based on rigorous statistical considerations, so that the assignment of unknown sequences to existing tax is largely heuristic and it is not readily possible to associate a probability to the assignment using advanced evolutionary genomics tools. The goal of the proposed research is to design and implement a new approach to metagenomic analysis based on statistical phylogenetics principles in order to generate more accurate and informative assignments. The new approach will utilize existing (and carefully assembled) multiple sequence alignments now publically available, as compared to the current system of using raw data in sequence banks. The new method will be tested using many empirical and simulated data sets for accuracy. These accuracies will be compared to those achieved by current state-of-the-art methods. Successful completion of this project will yield insights into factors responsible for successes and failures of the proposed and the existing methods, and it has a high likelihood of producing a useful method for evolutionary bioinformatics of metagenomic data.

Public Health Relevance

Metagenomic analysis has emerged as a powerful tool to analyze genetic, and thus organism, compositions of microbial communities that inhabit our planet and our bodies. The proposed statistical and computational research will result in the development of an evolutionary phylogenetic framework for an advanced analysis of the metagenomic data, which will improve the application of metagenomics to understanding microorganisms, both harmful and beneficial to humans.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-F (02))
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Arizona State University-Tempe Campus
Organized Research Units
United States
Zip Code
Filipski, Alan; Murillo, Oscar; Freydenzon, Anna et al. (2014) Prospects for building large timetrees using molecular data with incomplete gene coverage among species. Mol Biol Evol 31:2542-50
Kumar, Sudhir; Filipski, Alan J; Battistuzzi, Fabia U et al. (2012) Statistics and truth in phylogenomics. Mol Biol Evol 29:457-72