The present diversity of life has evolved from a single ancestor through billions of years of evolution. Understanding these evolutionary histories is fascinating, but more importantly, is a crucial precursor to many biological analyses. Some evolutionary relationships are obvious (e.g., a cat is closer to a lion than a chicken) but other consequential relationships are hard to discern. Luckily, evolution operates on the genomes of organisms, and the sequence of genetic changes leaves a trace of the evolutionary histories. Following these traces and reconstructing the evolutionary past, however, is a computational problem, and as it turns out, is a difficult problem. Sophisticated methods are needed to infer a phylogeny: a tree, called tree-of-life, that shows the historical relationships between species. When sequencing whole genomes became possible in the mid-2000s, many believed the sheer amount of data would result in robust reconstructions of phylogenies. While genome sequencing has fulfilled some of its promises, other challenges remain. Large-scale data are hard to adequately model and are hard to screen for errors. As a result, different analyses do not always agree, and also, inference algorithms are pushed to their limits of scalability. Thus, an improved understanding of the tree-of-life requires not just more data but also better algorithms. Interestingly, as data sciences permeate many areas of science, issues of robustness to error and scalability faced in phylogenetics will confront many disciplines. Thus, the next generation of data scientists needs to be trained to consider these concerns when developing algorithms for data analysis.

This project seeks to address current limitations in phylogenomics (phylogeny inference from whole genomes) and to integrate issues of robustness and scalability into teaching. The main challenge in phylogenomics is data heterogeneity, and there are two sources of data heterogeneity: real biological processes driving genome evolution that lead to discordant histories across the genome, and artefactual heterogeneity that results from complex pipelines used to prepare the data for inference. Models of real heterogeneity exist. However, current methods often require knowing the source of heterogeneity in advance, are often not scalable, are not always robust to artefactual heterogeneity. The approach taken here is to combine unsupervised learning and discrete optimization to build methods for identifying errors. These techniques will strive to minimize assumptions and will use both parametric and non-parametric statistics. The project will draw on machine learning, multi-criteria optimization, and high-performance computing. If successful, it will dramatically improve the accuracy and scalability of genome-wide phylogeny reconstruction and will help researchers understand intricate patterns in genome evolution. To integrate research and education, this project will enable yearly hackathons that bring together students with computational and biological expertise with the goal of developing robust and scalable methods. The project will also seek to improve the understanding of data science for undergrad and K-12 students, emphasizing for them both the excitement and challenges of analyzing large error-prone datasets. The tools developed here will be publicly available and well-documented. Yearly workshops will be held to help biologists learn and use the tools.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California San Diego
La Jolla
United States
Zip Code