Understanding the evolutionary relationships between organisms is fundamental in a wide variety of problems in biology. This project investigates and develops new methods for inferring species relationships from genetic data, utilizing probabilistic models of gene trees conditional on a species tree. Its main goals are (1) to advance the mathematical understanding of these models, with a view toward species tree inference; (2) to develop improved methods for species tree inference by considering new and underutilized data types derived from gene trees, including clades, splits, unrooted gene trees, and ranked gene trees; (3) to validate theoretical, computational, and statistical properties of these new methods; (4) to produce software for use by empirical biologists. The project will identify gene tree summary statistics on which accurate inference can be based, and will employ these statistics to develop practical methods that can be used in the presence of missing data and under violations of model assumptions. The mathematical, statistical, and computational properties of both new and current methods will be studied to enable comparisons that can guide empirical applications. The model-based, probabilistic approach of this work provides a foundation for enhancing species tree inference from gene tree samples, and thus from genetic sequence data. The project addresses a promising methodological middle ground between computationally intensive full likelihood and Bayesian analyses, which are often infeasible for genomic-scale data sets, and tractable combinatorial methods, which often lack desirable statistical behaviors. The work will advance phylogenetic analysis by deepening knowledge of probabilistic models of gene tree discordance through analysis of the behavior of summary statistics. It will improve the practice of species tree inference by introducing new statistically consistent approaches and by developing theoretical and experimental understanding of the robustness of methods. Further, its use of mathematical techniques from probability, combinatorics, and algebraic statistics, as well as computational experiments employing simulation, will enhance mathematical evolutionary biology more generally.
Inference of species relationships from genetic data is an essential component of biomedical science, for such disparate purposes as providing evolutionary insights, comparing model organisms, and understanding variation in pathogen strains. This project addresses the challenges of estimating species trees from large genomic data sets by providing new theoretical and practical tools.
Showing the most recent 10 out of 20 publications