Inferring the true evolutionary history for a group of organisms, taxa, is a difficult problem. For a given set of taxa, there is an exponential number of ways to depict their family tree. Hence, an exhaustive exploration of all possible trees is infeasible. As a result, the most popular techniques sample tree space in order to obtain an estimate of the true evolutionary tree. The challenge is to know when a an estimate of an evolutionary tree for a group of taxa has converged, which is important because non-convergence leads to inaccurate estimation of the true evolutionary tree.

The team will develop a suite of convergence detection algorithms for large-scale Markov Chain Monte Carlo phylogenetic analyses, one of the most popular techniques for reconstructing large-scale evolutionary trees that can handle hundreds of thousands of trees on hundreds to thousands of taxa. Convergence detection changes the framework for how these evolutionary trees are reconstructed. For example, analyses that have not yet converged, rather than be terminated based on some arbitrary specification (e.g., elapsed time), could be allowed to continue as long as progress toward convergence is detected. If progress is still not made, the phylogenetic analysis would be terminated saving significant time and computational resources. The approach arms life scientists with information for why their analysis did not converge.

The team will develop convergence detection techniques that are based on the topological structure (i.e., the evolutionary relationships contained in a tree) of the underlying phylogenetic tree instead of relying solely on its score. To address the above issues, the novel integrated framework consists of: (i) designing and analyzing new algorithms for convergence detection, (ii) identifying the causes for non-convergence in a phylogenetic analysis, (iii) performing real-time convergence analysis, and (iv) developing new visualization tools that provide informative views of convergence data.

There are many benefits that exist between the collaboration of a research university and an undergraduate liberal arts college. Both undergraduate and graduate students in both biology and computer science have an opportunity to design and implement algorithms and run computational experiments on large data sets that would otherwise be unavailable to them. The large trees that can be considered have applications in improving global agriculture and protecting ecosystems from invasive species. The results of this work will be presented and disseminated at scientific conferences, workshops, and journals. Tools and software developed will be made publicly available.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1018785
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2010
Total Cost
$397,003
Indirect Cost
Name
Texas A&M Engineering Experiment Station
Department
Type
DUNS #
City
College Station
State
TX
Country
United States
Zip Code
77845