Model-based phylogenetic analyses (maximum likelihood, Bayesian, and distance-based methods) rely on complex models of evolution of genetic data, models whose statistical properties are not well-understood. Particularly, software employs a discretized Gamma distribution and important statistical properties of that model, such as model identifiability and how many rate classes can be or need to be used, are unknown. We propose to determine the statistical properties of such models. Moreover, even the most sophisticated models fail to mimic many real data sets analyzed in phylogenetics: for instance, while the general time reversible model allows for arbitrary base frequencies, it requires that all the species under study have the same base frequencies as each other. The natural non-parametric alternative is the method of maximum parsimony which suffers from the phenomenon of long branch attraction: when data are generated under some model of genetic evolution on certain types of trees and then analyzed under parsimony methods, parsimony methods return an incorrect tree with some probability that does not tend to zero as the amount of data increases to infinity. Thus, it is often said that the method of maximum parsimony is not a consistent statistical method for phylogeny reconstruction. However, this criticism has the potential of applying to all phylogenetic reconstruction methods, including model-based methods, when the data are not generated under the model used to analyze them. We propose to determine the model conditions under which the natural non-parametric alternative, parsimony, is a consistent method for phylogenetic estimation. The goal of both projects is to provide solid mathematical foundation for phylogenetic reconstruction methods.
Phylogenies are trees describing the evolutionary relationships of species. Having an accurate description of these relationships can help researchers discover the genetic basis of human diseases and the reasons for varying pathogenicity of viruses and bacteria. This project aims to understand the mathematical properties of the statistical methods used to infer phylogenies from genetic data. These mathematical properties tell which methods are most appropriate for use on specific kinds of data and whether a method can be useful at all on any data. This information, in turn, will lead to more accurate phylogenies.