Due to the increasing ease with which DNA sequence data can be obtained, much of current phylogenetics involves use of  the information contained in representative DNA sequences for a set of sampled organisms for estimation.  The sequence data available for a phylogenetic analysis often include samples taken from multiple genes within each organism, and thus it becomes necessary to model the evolutionary process at two distinct scales.  First, given an overall phylogeny representing the actual evolutionary history of the species, individual genes evolve their own histories, called gene trees.  Then, along each gene tree, sequence data evolve, leading to the observed data that is used for inference.  The coalescent model provides the link between the evolution of the gene trees given the species tree, and the evolution of the sequence data given the gene trees. Phylogenetic invariants have been proposed as a tool for inferring phylogenies using data from a single gene, but have not been studied in the multi-gene coalescent setting. The investigators use phylogenetic invariants to study the coalescent model for species tree inference by addressing questions such as the identifiability of the tree and associated model parameters. In addition, they develop and implement methods to utilize phylogenetic invariants to estimate species trees from empirical DNA sequence data.

The inference of the evolutionary history of a collection of organisms based on the information contained in their DNA sequences is a problem of fundamental importance in evolutionary biology. The abundance of DNA sequence data arising from genome sequencing projects has led to significant challenges in the inference of these phylogenetic relationships. Among these challenges is the inference of the evolutionary history of a collection of species based on DNA sequence information from several distinct genes sampled throughout the genome. The two primary goals of this project are: (1) to determine what aspects of the true phylogenetic history can be accurately identified given the information available in typical DNA sequence data sets; (2) to develop methods for extracting the available information from the DNA sequences in order to accurately and efficiently estimate the true evolutionary relationships.  Both of these objectives are approached using methods from algebraic statistics.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1106706
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2011
Total Cost
$180,000
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210