An award is made to Harvard University and Ohio State University to develop statistical tools that can be used to build genealogical trees (phylogenies) of species using data sets consisting of multiple DNA sequences. Molecular data, such as DNA sequences, are frequently used to study the phylogenetic relationships of species, but current models for analyzing such data are not adequate and make a number of unjustified assumptions. In particular, existing methods tend to conflate the so called "gene tree" of each DNA sequence, which can vary from one gene to the next, and the "species tree," which is usually the parameter of interest but can often differ from the gene tree(s). This project will develop methods that treat the gene tree and species tree as separate entities to be estimated from DNA sequence data.

Phylogenetic trees are fundamental tools for biologists studying a wide array of phenomena, including the sources of disease outbreaks, cancer genetics, the evolution of humans and biodiversity. They are an important product of the human genome project and the many ongoing and completed genome projects will all produce data sets consisting of multiple genes. This project will allow biologists to make more efficient use of such data sets and to develop more accurate phylogenetic trees.

Project Report

This project focused on new methods for inferring the evolutionary relationships of species using DNA data. In the last few decades, scientists who study the history of life – evolutionary biologists and systematists – have frequently turned to the similarities and differences in DNA sequences between species to infer that history. That history is usually displayed in the form of a phylogeny, a branching diagram showing the ancestor-descendant relationships and branching pattern depicting the evolution of a group of species as it evolved from a single common ancestor. Frequently, the DNA sequences used for such studies come from multiple different genes in the genome. This last fact – that multiple genes are used to reconstruct the history of life – has consequences for how we analyze that data to accurately infer phylogenetic history. It turns out that chance events at the population level, such as genetic drift, means that the genealogies of individual genes may differ slightly from one another, even if they come from a species history that is singular and unique. As a result, statistical methods that allow for such ‘gene tree heterogeneity’ are necessary to account for this stochasticit. Models incorporating gene tree heterogeneity via a paradigm called the multispecies coalescent model were the subject of this project. Several so-called ‘coalescent’ or ‘species tree’ methods of phylogenetic inference were developed as part of this project, including methods such as BEST, STAR, STEAC, Maximum Tree, and MP-EST. These methods use a variety of information from the collected DNA sequences or gene trees from multiple species. Some of the methods, such as BEST, use DNA sequence data and model both the tree of genes and the tree of species simultaneously. Other methods, such as STAR and MP-EST, use gene trees as input data. Importantly, all of these methods have been tested with simulations to ensure that they are statistically consistent, meaning that, as the researcher collects more and more data, the methods converge to the correct answer. In particular, the methods have been tested using simulations in a region of parameter space called the anomaly zone, where gene tree heterogeneity is exceptionally high. Finally, uncertainty in the estimates of the gene trees and the species tree can be accommodated, either explicitly or via the bootstrap or other approaches. Inferring phylogenetic history is an important tool that biologists have for understanding the history of life, and such inference bears on many issues of practical importance, such as the emergence of infectious disease, the relationships of humans to other primates, the antiquity of species and many other issues. The methods developed as part of this grant will help evolutionary biologists, medical scientists, conservation scientists and many other kinds of scientists, use DNA sequence data in more productive and accurate ways.

Agency
National Science Foundation (NSF)
Institute
Division of Environmental Biology (DEB)
Application #
0743616
Program Officer
Maureen M. Kearney
Project Start
Project End
Budget Start
2008-05-15
Budget End
2014-04-30
Support Year
Fiscal Year
2007
Total Cost
$350,000
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138