Mathematical and statistical modeling of gene genealogies-trees that reflect ancestral relationships among sampled molecular sequences-is central to many biological fields, including population genetics, phylodynamics of infectious disease, paleogenomics, phylogenetics, and cancer genomics. Kingman's n-coalescent is a stochastic process of gene genealogies whose parameters depend on an evolutionary model. Inference of model parameters then contributes to an understanding of the phenomena that have given rise to the sequences. Though many sophisticated methods have been developed to date, major statistical and computational challenges remain because the state space of genealogies grows superexponentially with the number of samples. We are no longer data-limited but instead, we lack computational and statistical methods for analysis of large scale emerging genomic data sets. The long-term goal of the researchers is to develop statistically consistent and computationally efficient coalescent methods for exact inference of evolutionary parameters from next-generation sequencing datasets. The objective of this research is to apply the notion of lumpability of Kingman's n-coalescent to address the state-space explosion problem of coalescent methods. The basic idea is to model a coarser resolution of the underlying genealogy and reduce the cardinality of the hidden state space. These coarser coalescent models include Tajima's coalescent and the pure-death process coalescent.
The specific aims i nclude (1) prove theorems for coalescent models and provide theoretical and practical tools for addressing computational challenges when modeling different resolutions or lumpings of Kingman's coalescent; (2) develop scalable methods for inference of evolutionary parameters using different coalescent models; (3) theoretically and empirically validate the inference methods, applying them in simulations and in molecular sequences from infectious diseases such as Zika, as well as ancient DNA samples from bison in North America and ancient and modern human samples; (4) implement the novel methods in open source software, ensuring fast dissemination of the methodology among researchers. The research is innovative in many distinct ways. First, Tajima's coalescent has not yet been exploited for inference despite the potential based on the smaller state space. Second, the methods developed here will allow inference from data sets that have not been exploited before because of computational limitations. Third, we will not only provide a suite of tools ready for application but we will also provide statistical results supporting our implementations. Our proposed research on scalable modeling of genealogical trees will be significant in a number eJf fields, including the theory of evolutionary trees, statistical inference in population genetics and phylogenetics, and the analysis of molecular sequences from infectious disease and ancient DNA.
Coalescent models are fundamental in many public health related fields, including phylodynamics of infectious diseases, cancer genomics, phylogenetics, palogenomics and population genetics. This project will develop a new class of improved coalescent methods applicable to large-scale genetic studies, with statistical results to support the methods.