Phylogenetic analysis is a key tool in multiple areas including disease monitoring and drug design;its goal is to infer evolutionary relationships among multiple species, as well as to provide insights into the mechanisms driving the process of molecular evolution. This proposal is informed by two recent trends in phylogenetic analysis. On one hand, most current approaches for phylogenetic analysis require sequence alignments as input and produce reliable results only for proteins with at least a moderate degree of sequence similarity. On the other hand, the scientific community has started to realize that standard procedures for phylogenetic analysis, which first construct a sequence alignment and then use this single point estimate to guide the construction of the phylogenetic tree, can introduce serious biases and make researchers overconfident about the inferred evolutionary history. Indeed, alignment and tree construction are two interrelated problems that should be tackled jointly rather than sequentially. The proposed work represents the first attempt to include structural protein alignments in phylogenetic analysis while jointly accounting for uncertainty in both alignment and tree construction. Our approach employs Markov chain Monte Carlo algorithms to generate samples from the posterior distribution of alignments and trees given the sequences and structures, providing a straightforward procedure to compute probabilities of hypotheses of interest.
Specific aims of this project include: 1) To develop novel methods for using unaligned proteins to improve our understanding of the evolutionary relationship between protein sequence and tertiary structure. 2) To develop models for phylogenetic analysis that incorporate sequence and structure information and account for uncertainty in the alignment in the construction of phylogenetic trees and the estimation of evolutionary parameters. 3) To develop new computational algorithms for analyzing a large number of unaligned proteins. 4] To train interdisciplinary scientists capable of using sophisticated statistical methods to solve complex problems in evolutionary biology.

Public Health Relevance

This research will generate improved methods for investigating phylogenetic relationships over longer evolutionary timescales, improving our understanding of protein function. Since phylogenies capture the biological history and the correlation between living organism, these methods will have an impact on determining the origins and infection pattern of emerging diseases such as SARS and designing more effective drugs for rapidly evolving diseases such as influenza.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZGM1-CBCB-5 (BM))
Program Officer
Eckstrand, Irene A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Santa Cruz
Engineering (All Types)
Schools of Engineering
Santa Cruz
United States
Zip Code
Rodriguez, Abel; Martinez, Julissa C (2014) Bayesian semiparametric estimation of covariate-dependent ROC curves. Biostatistics 15:353-69
Herman, Joseph L; Challis, Christopher J; Novák, Ádám et al. (2014) Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol Biol Evol 31:2251-66
Daniels, Kyle G; Tonthat, Nam K; McClure, David R et al. (2014) Ligand concentration regulates the pathways of coupled protein folding and binding. J Am Chem Soc 136:822-5
Cartwright, Reed A; Lartillot, Nicolas; Thorne, Jeffrey L (2011) History can matter: non-Markovian behavior of ancestral lineages. Syst Biol 60:276-90
Yokoyama, Ken Daigoro; Thorne, Jeffrey L; Wray, Gregory A (2011) Coordinated genome-wide modifications within proximal promoter cis-regulatory elements during vertebrate evolution. Genome Biol Evol 3:66-74
Datta, Saheli; Prado, Raquel; Rodríguez, Abel et al. (2010) Characterizing molecular adaptation: a hierarchical approach to assess the selective influence of amino acid properties. Bioinformatics 26:2818-25