In this project, a team of investigators will develop new algorithms and software to simultaneously align DNA sequences and reconstruct phylogenetic trees. This methods and theory-oriented project addresses an important problem in phylogenetic reconstruction: relatively poor performance of existing tools in the face of insertions, deletions, and duplications in large datasets. This project will develop a simultaneous approach to DNA sequence alignment and phylogenetic analysis that will allow researchers to overcome these problems. Specific goals for the project will be to develop a portal and open-source software for simultaneous alignment and phylogenetic analysis, develop new simulators to model DNA sequence evolution, establish a working group on alignment methods with the Assembling the Tree of Life (AToL) community, and develop training programs in alignment and phylogeny estimation with outreach activities to minority institutions. The project includes many members of the Cyberinfrastructure for Phylogenetic Research (CIPRES) project and will provide significant new analytic capabilities for that data resource.

By making simultaneous alignment and phylogenetic analysis feasible for very large datasets, this project will provide software tools that will serve a broad community of researchers conducting phylogenetic analyses of DNA sequence data. These tools will enable consideration of DNA regions for phylogenetic analysis that cannot be aligned using existing tools. An open-source, portal interface will open multiple sequence alignment and tree-building to a broader range of users and engagement of existing AToL users will provide input and evaluation early in the software development process.

Project Report

Overview: The main goal of this project was to develop methods for simultaneous estimation of multiple sequence alignments and phylogenetic trees, with an emphasis on large datasets. In addition, because the estimation of each improves the estimation of the other, research that focused on improving methods for one problem but not the other was also desirable. An additional goal was the engagement of the biology research community through training in the use of the software developed by the project. Intellectual Merit: The major contribution was the development of three methods for co-estimation of alignments and trees: SATe (Liu et al., Science 2009), SATe-II (Liu et al., Systematic Biology 2012), and PASTA (Mirarab et al. RECOMB 2014, and Journal of Computational Biology, in press). The first of these methods was able to compute highly accurate alignments and trees within a 24 hour period, and could analyze up to 10,000 sequences. The second of these methods was faster and more accurate, and could analyze up to 50,000 sequences. The third method, PASTA, is even faster and again more accurate, and can analyze up to one million (1,000,000) sequences. Other research contributions include UPP (ultra-large alignments using ensembles of Hidden Markov Models, to appear in RECOMB 2015) and several methods for species tree estimation in the presence of gene tree conflict due to incomplete lineage sorting. Methods developed by the project are now being used in many biology research papers, including a recent publication in the Proceedings of the National Academy of Sciences (Wickett, Mirarab, et al., PNAS 2014) and Science (Jarvis, Mirarab et al., Science 2014). Broader Impact: The project held Symposia and Software Schools to train students, postdoctoral fellows, and scientists, in the use of project software; these symposia and software schools trained 50-150 people each year. The project also had a summer research training program with Huston-Tillotson University, an HBC in Austin, Texas. Several PhD students were trained on the grant.

Agency
National Science Foundation (NSF)
Institute
Division of Environmental Biology (DEB)
Application #
0733029
Program Officer
Simon Malcomber
Project Start
Project End
Budget Start
2007-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2007
Total Cost
$1,590,260
Indirect Cost
Name
University of Texas Austin
Department
Type
DUNS #
City
Austin
State
TX
Country
United States
Zip Code
78712