Estimation of Large Species/Population Trees Using Tree Space

RoyChoudhury, Arindam

Abstract

The estimation of the evolutionary history of a collection of organisms based on the information contained in their DNA sequences is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to important computational challenges in the estimation of these phylogenetic relationships. Among these challenges is the estimation of the evolutionary history for a group of species based on DNA sequence information from several distinct genes sampled throughout the genome. This research is focused on the development of computationally efficient methods for estimating the evolutionary history when the number of species under consideration is very large (i.e., hundreds to thousands). This is accomplished by considering collections of three species at a time, and using properties of the estimated evolutionary history for groups of three to infer the overall evolutionary history. Properties and performance of the method will be evaluated theoretically as well as with both simulated and empirical data sets. This work has numerous practical applications, such as the study of the evolutionary relationships among human populations.

Though the amount of genomic data available for inferring phylogenetic species trees has increased rapidly within the last 10 years, few methods have been developed to efficiently estimate species trees for data sets consisting of hundreds or thousands of species. A fast approximation to the maximum likelihood estimate (MLE) that retains desirable statistical properties, such as consistency and asymptotic efficiency, is proposed. Results from preliminary work suggest that this approach will be significantly faster than existing likelihood and Bayesian approaches, while also being highly accurate. The method can be applied to a range of data types, including allele frequency data arising under a Brownian motion model along the phylogeny and single nucleotide polymorphism (SNP) data arising from the coalescent model. A software package will be developed to implement the methodology. The project will also support one PhD student, who will contribute to the development and implementation of the methodology.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Type: Standard Grant (Standard)
Application #: 1832303
Program Officer: Gabor Szekely

Project Start
Project End
Budget Start: 2017-07-01
Budget End: 2021-07-31
Support Year
Fiscal Year: 2018
Total Cost: $138,594
Indirect Cost

Estimation of Large Species/Population Trees Using Tree Space
RoyChoudhury, Arindam
Joan and Sanford I. Weill Medical College of Cornell University, New York, NY, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments