Phylogenetic reconstruction is an invaluable tool for studying molecular sequences. Starting from a description of how the characters in the sequences mutate over time, the methods attempt to uncover the sequences'relatedness. Common applications range from describing the evolutionary histories of living organisms in evolutionary biology to estimating genetic distances and constructing protein families in molecular biology and bioinformatics. Standard reconstruction methods rely on sequence alignments that specify which characters in the sequences are homologous, deriving from common ancestors. A fundamental difficulty is that sequence alignments are not directly observed;they are inferred properties of the raw sequence data and must be estimated along with the phylogeny. Current tools handle this inference sequentially, first determining a sometimes poor estimate of the alignment and then conditioning on the truth of alignment to reconstruct the phylogeny. This project provides practical tools for end-users to simultaneously infer alignment and phylogeny, side-stepping biases that sequential estimation introduces. The tools assume both a character substitution model and an insertion/deletion (indel) process through which characters are added or removed generating an alignment. Further, these indels supply previously under-utilized information from the data to infer phytogenies. Major advances make this phylo-alignment framework useful for real-life datasets. The framework draws heavily on hidden Markov models, Bayesian computation and clever parameter integration to produce a computationally efficient inference engine. Expert prior knowledge helps inform the indel process. From this, realistic priors enable Bayes factor tests to address if specific indels are shared by descent or are homoplastic, reducing controversy over their value in phylogenetics. Modeling assumptions better reflect the underlying biology. Allowing spatial variation in the indel process provides more accurate phytogenies and alignments. The extensions also provide for heterogeneity tests to identify evolutionary interesting sequence regions. Examples of the methods span all time-scales of evolution, across billions of years to infer early branches in the Tree of Life to matters of months to describe the diversification of rapidly evolving viruses within infected hosts. This project markedly impacts many fields across biomedical research. For example, the project furnishes mathematical and statistical training in bioinformatics which will play a prime role in discovery during the 21st century, and rigorous inference tools employing phylo-alignment deliver improved molecular, comparative studies, a more accurate understanding of human evolution and new perspectives from which to battle infectious diseases.
Showing the most recent 10 out of 65 publications