The estimation of the evolutionary history of a collection of organisms based on the information contained in their DNA sequences is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to important computational challenges in the estimation of these phylogenetic relationships. Among these challenges is the estimation of the evolutionary history for a group of species based on DNA sequence information from several distinct genes sampled throughout the genome. This research is focused on the development of computationally efficient methods for estimating the evolutionary history when the number of species under consideration is very large (i.e., hundreds to thousands). This is accomplished by considering collections of three species at a time, and using properties of the estimated evolutionary history for groups of three to infer the overall evolutionary history. Properties and performance of the method will be evaluated theoretically as well as with both simulated and empirical data sets. This work has numerous practical applications, such as the study of the evolutionary relationships among human populations.

Though the amount of genomic data available for inferring phylogenetic species trees has increased rapidly within the last 10 years, few methods have been developed to efficiently estimate species trees for data sets consisting of hundreds or thousands of species. A fast approximation to the maximum likelihood estimate (MLE) that retains desirable statistical properties, such as consistency and asymptotic efficiency, is proposed. Results from preliminary work suggest that this approach will be significantly faster than existing likelihood and Bayesian approaches, while also being highly accurate. The method can be applied to a range of data types, including allele frequency data arising under a Brownian motion model along the phylogeny and single nucleotide polymorphism (SNP) data arising from the coalescent model. A software package will be developed to implement the methodology. The project will also support one PhD student, who will contribute to the development and implementation of the methodology.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1832303
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-07-01
Budget End
2021-07-31
Support Year
Fiscal Year
2018
Total Cost
$138,594
Indirect Cost
Name
Joan and Sanford I. Weill Medical College of Cornell University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10065