Inferring the true evolutionary history for a group of organisms, taxa, is a difficult problem. For a given set of taxa, there is an exponential number of ways to depict their family tree. Hence, an exhaustive exploration of all possible trees is infeasible. As a result, the most popular techniques sample tree space in order to obtain an estimate of the true evolutionary tree. The challenge is to know when a an estimate of an evolutionary tree for a group of taxa has converged, which is important because non-convergence leads to inaccurate estimation of the true evolutionary tree. The team will develop a suite of convergence detection algorithms for large-scale Markov Chain Monte Carlo phylogenetic analyses, one of the most popular techniques for reconstructing large-scale evolutionary trees that can handle hundreds of thousands of trees on hundreds to thousands of taxa. Convergence detection changes the framework for how these evolutionary trees are reconstructed. For example, analyses that have not yet converged, rather than be terminated based on some arbitrary specification (e.g., elapsed time), could be allowed to continue as long as progress toward convergence is detected. If progress is still not made, the phylogenetic analysis would be terminated saving significant time and computational resources. The approach arms life scientists with information for why their analysis did not converge.

The team will develop convergence detection techniques that are based on the topological structure (i.e., the evolutionary relationships contained in a tree) of the underlying phylogenetic tree instead of relying solely on its score. To address the above issues, the novel integrated framework consists of: (i) designing and analyzing new algorithms for convergence detection, (ii) identifying the causes for non-convergence in a phylogenetic analysis, (iii) performing real-time convergence analysis, and (iv) developing new visualization tools that provide informative views of convergence data.

There are many benefits that exist between the collaboration of a research university and an undergraduate liberal arts college. Both undergraduate and graduate students in both biology and computer science have an opportunity to design and implement algorithms and run computational experiments on large data sets that would otherwise be unavailable to them. The large trees that can be considered have applications in improving global agriculture and protecting ecosystems from invasive species. The results of this work will be presented and disseminated at scientific conferences, workshops, and journals. Tools and software developed will be made publicly available.

Project Report

This grant was a collaborative effort over three years between two Principle Investigators (PIs) from Vassar College and Texas A&M University. My undergraduate students had the opportunity to contribute to the goals of this grant while working with my co-PI’s (Tiffani Williams's) Ph.D. students, which created a rewarding mentoring experience for both the graduate and undergraduate students. My students had opportunities to present the results of their work at Vassar’s annual Undergraduate Research Summer Institute Symposium, as well as at a national computer science education conference (SIGCSE), where two of my students competed in the ACM Student Research Competition, and one of my students won first place in the undergraduate division. The major research objectives of this project included: Designing and analyzing new algorithms for convergence detection; Identifying the causes for non-convergence in a phylogenetic analysis; Incorporating our techniques in order to perform real-time convergence analysis; and Developing new interactive and immersive visualization tools that provide informative views of convergence data. We succeeded in making progress on all four objectives, thanks to my students being able to contribute to the development of two tools for phylogenetic analysis being developed by my co-PI’s Ph.D. students. One of those tools was a customized version of MCMCTree, which is part of the PAML suite of tools. This tool was developed before evolutionary trees contained as many taxa (organisms) as they do today. The original tool, based on Markov Chain Monte Carlo techniques, uses both fossil data and a phylogenetic tree, and estimates dates when taxa diverged from common ancestors. With large modern trees (like the set of all known mammals), the original MCMCTree would have run for ten months to compute a solution, if it didn’t run out of memory! The refactored version of MCMCTree (by Dr. Williams’s Ph.D. student) reduced the memory requirements and computation time to just one month. My student was able to further enhance MCMCTree’s capabilities by introducing checkpoint-restart functionality. A long-running program has the challenge of completing before a system crash occurs, and if they don’t, the program must start all over from the beginning. With checkpoint-restart capability, a program backs up periodically during its computation, and if the system crashes, it can resume its computation from the last checkpoint backup. These enhancements enable MCMCTree to better handle today’s larger trees and fossil data. The other tool my students worked on was TreeHouse, which was designed and implemented from the collective work of three of my co-PI’s Ph.D. students. TreeHouse is an interactive tool for storing, analyzing, manipulating, and visualizing large datasets of evolutionary trees. My students contributed heavily toward the visualization capabilities of TreeHouse. This visualization refers both to individual trees and collections of trees. Phylogenetic data is one of many examples today of Big Data, and visualization is one of the big challenges of big data. The ability to visualize large evolutionary trees (trees with hundreds or thousands of taxa) makes analyzing individual trees, or zooming in to a particular part of a tree, possible. The reason we have these large collections of trees today is because that’s what modern phylogenetic search tools return after searching tree space for trees containing a particular set of taxa. So not only are individual trees large, but collections of these trees are even larger. In order to determine whether the many trees in a collection have converged (are all similar in shape to each other), TreeHouse needed additional distance measures to compare trees, and clustering algorithms which use these distance measures to determine how similar all the trees in a collection are to each other. My students implemented two new distance measures, conflicting quartet distance, and Hamming distance, to augment TreeHouse’s existing Robinson-Foulds (RF) distance measure. Each measure reflects different relationships and bases for comparison among trees, which is good, because there is no single "correct" distance measure. My students also implemented three clustering algorithms to determine whether the trees in a dataset all clustered together (in one location in tree space), or whether there were two or more clusters of similar trees—or whether there were no clusters at all, indicating the search diverged (failed). The three clustering algorithms we implemented were Agglomerative, K-Means, and DBSCAN clustering, along with the ability to render the resulting clusters visually. Each of the three methods clusters trees based on a distance or similarity measure, but all achieve different clustering results, which is also good, since there is no one "correct" way to detect clusters of trees. The choice of distance measures and clustering algorithms, and the means to visualize different ways a tree dataset clusters, provide life scientists with powerful tools for analyzing and detecting whether their phylogenetic searches converged.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1018311
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2010
Total Cost
$102,997
Indirect Cost
Name
Vassar College
Department
Type
DUNS #
City
Poughkeepsie
State
NY
Country
United States
Zip Code
12604