In this project, a team of investigators will develop new algorithms and software to simultaneously align DNA sequences and reconstruct phylogenetic trees. This methods and theory-oriented project addresses an important problem in phylogenetic reconstruction: relatively poor performance of existing tools in the face of insertions, deletions, and duplications in large datasets. This project will develop a simultaneous approach to DNA sequence alignment and phylogenetic analysis that will allow researchers to overcome these problems. Specific goals for the project will be to develop a portal and open-source software for simultaneous alignment and phylogenetic analysis, develop new simulators to model DNA sequence evolution, establish a working group on alignment methods with the Assembling the Tree of Life (AToL) community, and develop training programs in alignment and phylogeny estimation with outreach activities to minority institutions. The project includes many members of the Cyberinfrastructure for Phylogenetic Research (CIPRES) project and will provide significant new analytic capabilities for that data resource.

By making simultaneous alignment and phylogenetic analysis feasible for very large datasets, this project will provide software tools that will serve a broad community of researchers conducting phylogenetic analyses of DNA sequence data. These tools will enable consideration of DNA regions for phylogenetic analysis that cannot be aligned using existing tools. An open-source, portal interface will open multiple sequence alignment and tree-building to a broader range of users and engagement of existing AToL users will provide input and evaluation early in the software development process.

Project Report

The over-arching goal of this collaborative research project is to develop methods for simultaneous multiple sequence alignment (MSA) and tree inference that can scale to thousands of taxa and produce better alignments and trees than current methods for Tree of Life scale data. The main objective of our group at University of Nebraska-Lincoln was to develop tools that can be used by other groups of this collaborative project for testing phylogenetic methods they develop. Specifically, we developed a sequence simulation program (indel-Seq-Gen or iSG), large-scale benchmark datasets of simulated protein and DNA sequences, and SuiteMSA, visual tools for multiple sequence alignment (MSA) comparison. iSG is capable of simulating biologically realistic sequence evolution at both DNA (coding and non-coding) and protein levels. It allows to simulate highly divergent evolution incorporating insertion/deletion events, discrete evolutionary steps, lineage-specific as well as site-specific parameterizations, motif conservation, subsequence length constrains, and event tracking. iSG is freely available from: http://bioinfolab.unl.edu/~cstrope/iSG/ Using iSGv2.1, large-scale benchmark alignment datasets have been produced for both DNA and protein sequences. The DNA alignment datasets include 20 replicates each for 125 distinct model conditions (gap lengths, indel occurrence probability, evolutionary rates). The protein alignment datasets include 30 replicates each for 135 distinct model conditions (indel occurrence probabilities, evolutionary rates, and shape parameters of the gamma rate distribution). For each model, 5000 and 10000 taxa of sequences were generated. Based on the true alignments and true phylogenies, useful statistics (e.g., alignment properties, error rates) wee also collected. All benchmark alignment datasets are publicly available from: http://bioinfolab.unl.edu/~cstrope/iSG/benchmark/index.html While numerous MSA reconstruction methods have been developed, often regular users simply run one MSA method and proceed directly to the next analysis without examining the alignment output. Considering the importance of MSAs, it is desirable if quality assessment of MSA methods can be performed more easily and more intuitively. We developed SuiteMSA for this reason. It provides, in addition to the capability of setting up and running iSGv2.1 simulation and displaying simulation results graphically, several types of MSA assessment tools. MSA Viewer can used to color-code an MSA based on various residue properties, and secondary structure/transmembrane predictions. MSA Comparator is a pairwise MSA comparison tool. It calculates various alignment comparison statistics including sum of pair, column, and Shift scores. Pixel Plot is a multiple MSA comparison tool. It can be used to compare large-scale MSAs based on gap distribution patterns. Secondary structure and transmembrane prediction based color-coding is also available for easy visual assessment. iSG simulation GUI, phylogeny viewer/editor, and MSA reconstruction GUI for MUSCLE and ClustalW2 are also included. It is a java-based program, compatible for Macintosh OS X, Linux, and Windows. SuiteMSA is available freely from our website: http://bioinfolab.unl.edu/~canderson/SuiteMSA/ Two PhD students and five undergraduate students have been involved with various part of this project. One of the PhD students was also supported as a postdoc after his graduation. As part of the larger collaborated research team, we organized three symposia/workshops on large-scale phylogenetics and phylogenomics. At each workshop, we performed demonstration and tutorial for SuiteMSA and iSG.

Agency
National Science Foundation (NSF)
Institute
Division of Environmental Biology (DEB)
Application #
0732863
Program Officer
Simon Malcomber
Project Start
Project End
Budget Start
2007-10-01
Budget End
2013-09-30
Support Year
Fiscal Year
2007
Total Cost
$266,830
Indirect Cost
Name
University of Nebraska-Lincoln
Department
Type
DUNS #
City
Lincoln
State
NE
Country
United States
Zip Code
68588