Continuing advances in nucleotide sequencing have resulted in the assembly of datasets containing large numbers of species, genes, and genomic segments. Phylogenomic analyses of these data are essential to progress in understanding evolutionary patterns across the tree of life, and are finding increasing numbers of applications in practical analyses that require understanding of how patterns change over time. The sheer size of phylogenomic datasets limits the practical utility of available methods due to excessive time and memory requirements. We have developed many high impact methods and tools for comparative analysis of molecular sequences, a tradition we propose to continue through this MIRA project by developing innovative methods that address new challenges in phylogenomics. We will focus on pattern-based approaches of machine learning with sparsity constraint (SL) applied to phylogenomics, as a complement to traditional model-based methods in molecular evolution and phylogenetics. In the proposed SL in Phylogenomics (SLiP) framework, we will build models that best explain the biological trait or evolutionary hypothesis of interest, with genomic loci, such as genes, proteins, and genomic segments, serving as model parameters. Preliminary results from two example applications establish the premise and promise of a general SLiP framework. In one, SLiP successfully detected loci whose inclusion in a phylogenomic dataset overtakes a consistent and contrasting signal from hundreds of other loci when inferring phylogenetic relationships. In the other example, SLiP revealed loci and biological functional categories that harbor convergent sequence evolutionary patterns associated with the emergence of the same trait in distinct evolutionary lineages. In all of these analyses, SLiP required only a small fraction of the computational time and memory demanded by traditional methods, and it enabled better evolutionary contrasts with fewer assumptions. Consequently, the successful development of SLiP will improve the feasibility, rigor, and reproducibility of large-scale data analysis. It will also democratize big data analytics via shortened analysis time and a relatively small memory footprint, and encourage the development of a new class of methods for phylogenomic analysis. This framework will be accessed from a free library of SLiP functions, which will be directly useable via command line and available in a graphical interface through integration with the MEGA software.
The long-term goal of my research program is to develop methods and tools for comparative analysis of molecular sequences. In this project, we will develop a new class of phylogenomic methods based on sparse machine learning and benchmark their absolute and relative performance. New techniques and their software implementation will greatly facilitate data analyses that are vital for evolutionary and functional genomics.