The University of Florida is awarded a grant to implement and evaluate the latest methods for identifying the evolutionary history of gene duplications and losses on the Microsoft Azure cloud computing platform, and will use these methods to reconstruct the history of whole genome duplications in plants. One of the greatest challenges in evolutionary biology is to identify genetic mechanisms responsible for adaptive changes and species diversification. The availability of large-scale genomic data sets from many diverse species provides unprecedented opportunities to identify such important genetic changes. Gene duplication plays a key role in gaining new gene functions and, consequently, adaptive innovations. However, in order to link gene duplications with adaptive changes, it is necessary to determine when in evolutionary history the duplications took place. Recently developed model-based methods enable scientists to map the locations of gene duplications and loss events within a species phylogeny. However, these methods are computationally intensive, and consequently, have only been implemented for small data sets. Cloud computing through the Microsoft Azure platform offers the ideal system in which to extend the implementations of these methods to incorporate full genomic data sets from many organisms and to keep pace with the rapid accumulation of new genome sequences. Education and training in computational biology are a major component of this project. Not only will this work motivate new research into modeling gene evolution and enable enormous analyses to identify potential genomic innovations, it also will provide unique opportunities for cross-disciplinary training for a post-doc and graduate student. Furthermore, educational resources on the uses of cloud computing for large-scale bioinformatics analyses will be developed for the classroom and internet, and a workshop on cloud computing for evolutionary analyses will be held in conjunction with a conference of evolutionary biologists.
Intellectual Merit: Constructing evolutionary relationships, and ultimately the tree of life representing the relationships of all species, is one major goals of evolutionary biology, and it will enable insights into many aspects of biology, including identifying the genetic mechanisms of important traits. The availability of genome sequences from a wide range of species provides a wealth of data to resolve relationships among species and to identify important changes in genes. However, these genomic data also present many computational challenges. For example, the evolutionary history of a gene (a gene tree) often conflicts with the evolutionary history of the species in which the gene evolves (a species tree). Such discordance arises naturally as a result of biological processes such as gene duplication and loss, hybridization, lateral gene transfer, incomplete lineage sorting, and recombination. Error in the phylogenetic reconstruction of gene trees also contributes to high levels of incongruence among gene trees. Thus, reconciling the topologies of gene tree topologies with species trees presents an enormous computational challenge that is critical both for inferring phylogenetic relationships as well as understanding the patterns of gene evolution within the species. This project developed new computational approaches to infer the evolutionary histories of species and the genes evolving within them. These approaches include methods to identify unsupported parts of phylogenetic trees, suggesting parts of the tree that need more data or further analysis, and methods to identify the relationships with the strongest support from collections of species trees or gene trees. These analyses can improve the accuracy of evolutionary analyses and assess the confidence in the resulting trees. Also, we developed new methods to infer large trees from collections of gene trees with discordant evolutionary histories and to merge phylogenetic trees with overlapping taxon sets to build extremely large species trees. Since these analyses may use extremely large genomic data sets, they require fast algorithmic approaches and often parallel computing, including using resources such as the Microsoft Azure cloud computing platform. We demonstrated the ability of the new methods to reconstruct the backbone relationships among flowering plants from genomic data, and they also appear to help infer general patterns of gene duplication and loss in plants. Broader Impacts: This project generated new methods and freely available software for phylogenetic inference and gene tree – species tree reconciliation. This includes software to identify unsupported parts of large phylogenetic trees based on new tree stability measures, identifying the strongly supported relationships from large collections of gene trees and species trees, and construct species tree from gene trees with conflicting phylogenetic signals. We organized and hosted a workshop at the 2012 ACM meetings in Bioinformatics and Computational Biology to present some results from the project and link our research with work from other researchers worldwide. We also provided cross-disciplinary training for three post-docs and seven graduate students.