Many important subjects in biological and biomedical research require a robust means of phylogenetic tree inference: for models of viral transmission, for gene function inference, and for assessment of genetic diversity in the human microbiome, to name a few. These applications also depend on a rigorous means of assessing tree inference uncertainty; the Bayesian framework provides a principled means of assessing and integrating out this uncertainty. The currently available Bayesian algorithmic tools are not capable of performing inferences on large modern data sets, which also may be continually changing as new sequencing results become available. In particular, state-of-the-art methods are almost exclusively based on random-walk Markov chain Monte Carlo (MCMC) using uniformly selected local moves, even though most of these local moves will substantially worsen even a mediocre tree. Convergence problems with this approach are well documented, and thus current methods are limited to around 1000 sequences, a number much smaller than the size of microbial and immune data sets relevant to modern biomedicine. In addition, all current methods require inference to be started from scratch each time the sequence data changes. The broader impacts of this work will extend in three directions: enabling novel applications of Bayesian phylogenetics, stimulating new areas of computer science research, and attracting new talent to the field.

Applications of phylogenetics, in particular Bayesian phylogenetics, are being significantly held back by computational limitations. High-throughput sequencing technologies can return millions of sequences for studies of the human microbiome, viruses, oceanic microbes and antibody-making B Cells but theses cannot be handled with current methods. The models also need to be more realistic, without assumptions of independent interactions. Understanding the shape of multidimensional phylogenetic likelihood surfaces in detail might help to improve the topology. The teams will also investigate when an optimal tree on a taxon sets contains the optimal tree on a taxon subset. These will help to expand the approach to phylogenetic inference. These algorithmic insights will be incorporated into publicly available inference packages with a goal to provide inference on an order of magnitude more taxa than currently possible.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
2110182
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2020-10-01
Budget End
2021-06-30
Support Year
Fiscal Year
2021
Total Cost
$122,180
Indirect Cost
Name
Fred Hutchinson Cancer Research Center
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98109