Phylogenomics is a relatively new field that seeks to understand evolutionary relationships between organisms at the scale of the whole genome. One of the central questions in evolutionary biology is a better understanding of the relationships between organisms, usually summarized in the form of a phylogenetic tree. The methods in common use for developing these trees tend to work best for closely related organisms, and when the sequences are relatively short; for example, the DNA sequence for a single gene applied to a collection of mammals. When comparing more distantly related organisms, or data from large portions of the genome, current techniques can break down. Since modern technology can quickly and cheaply produce genome-scale sequence data, there is a pressing need for better analytical tools tailored to this large-scale high-dimensional data. The most popular statistical methods for finding general patterns in large-scale data, such as Principal Component Analysis (PCA), make the assumption that the space where the data lies is flat, like the plane geometry of Euclid. However, the space of possible phylogenetic trees has a decidedly non-Euclidean geometry, with a surface more akin to an origami figure made with a sheet of rubber. The goal of this project is to develop alternative types of principal components, and methods to calculate them, which take into account the unusual structural features of the mathematical space of phylogenetic trees.
PCA is a statistical method that takes data points in a high dimensional Euclidean space into a lower dimensional plane which minimizes the sum of squares between each point in the data set and their orthogonal projection onto the plane. It has been used for clustering high dimensional data points for statistical analysis and it is one of the simplest and most robust ways of doing dimensionality reduction in a Euclidean vector space. However, it assumes the properties of a Euclidean vector space. The space of all possible phylogenies on a fixed set of species does not form a Euclidean vector space, so PCA must be reformulated in the geometry of a tree-space. Motivated by the previous work by T. Nye in 2011 on construction of the first principal component, or principal geodesic, the PIs propose two geometric objects under different metrics which represent a k-th order principal component: (1) the locus of the weighted Frechet mean of k+1 points in a tree-space, where the weights vary over the associated probability simplex, under the Billera-Holmes-Vogtman (BHV) metric and (2) the tropical convex hull of k+1 points in a tree-space via the tropical metric in tropical geometry known as the max-plus algebra. The first aim of this project is to prove properties of the PCA under the BHV metric and the PCA under the tropical metric over tree-spaces. Then, the second aim is to develop efficient algorithms to compute/approximate them. Simulation studies will be conducted to show these algorithms perform well. Then the PIs will apply these algorithms to empirical data sets, such as Apicomplexa, a phylum of parasitic alveolates including malaria, and African coelacanth genomes, and sequences of hemagglutinin for influenza from New York. The broader impact will include advising undergraduate students for the implementation of the algorithms and user interfaces of the software products. These research experiences will complement a new Data Science program being developed as a component of the current Hawaii EPSCoR program. A portion of the summer effort will also be used to collaborate with nearby high school science and engineering programs in the development of data analysis lesson modules.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.