Phylogenetic analysis is a key tool in multiple areas including disease monitoring and drug design;its goal is to infer evolutionary relationships among multiple species, as well as to provide insights into the mechanisms driving the process of molecular evolution. This proposal is informed by two recent trends in phylogenetic analysis. On one hand, most current approaches for phylogenetic analysis require sequence alignments as input and produce reliable results only for proteins with at least a moderate degree of sequence similarity. On the other hand, the scientific community has started to realize that standard procedures for phylogenetic analysis, which first construct a sequence alignment and then use this single point estimate to guide the construction of the phylogenetic tree, can introduce serious biases and make researchers overconfident about the inferred evolutionary history. Indeed, alignment and tree construction are two interrelated problems that should be tackled jointly rather than sequentially. The proposed work represents the first attempt to include structural protein alignments in phylogenetic analysis while jointly accounting for uncertainty in both alignment and tree construction. Our approach employs Markov chain Monte Carlo algorithms to generate samples from the posterior distribution of alignments and trees given the sequences and structures, providing a straightforward procedure to compute probabilities of hypotheses of interest.
Specific aims of this project include: 1) To develop novel methods for using unaligned proteins to improve our understanding of the evolutionary relationship between protein sequence and tertiary structure. 2) To develop models for phylogenetic analysis that incorporate sequence and structure information and account for uncertainty in the alignment in the construction of phylogenetic trees and the estimation of evolutionary parameters. 3) To develop new computational algorithms for analyzing a large number of unaligned proteins. 4] To train interdisciplinary scientists capable of using sophisticated statistical methods to solve complex problems in evolutionary biology.

Public Health Relevance

This research will generate improved methods for investigating phylogenetic relationships over longer evolutionary timescales, improving our understanding of protein function. Since phylogenies capture the biological history and the correlation between living organism, these methods will have an impact on determining the origins and infection pattern of emerging diseases such as SARS and designing more effective drugs for rapidly evolving diseases such as influenza.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
1R01GM090201-01
Application #
7787329
Study Section
Special Emphasis Panel (ZGM1-CBCB-5 (BM))
Program Officer
Eckstrand, Irene A
Project Start
2009-09-30
Project End
2014-07-31
Budget Start
2009-09-30
Budget End
2010-07-31
Support Year
1
Fiscal Year
2009
Total Cost
$299,999
Indirect Cost
Name
University of California Santa Cruz
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
125084723
City
Santa Cruz
State
CA
Country
United States
Zip Code
95064
Mukherjee, Chiranjit; Rodriguez, Abel (2016) GPU-powered Shotgun Stochastic Search for Dirichlet process mixtures of Gaussian Graphical Models. J Comput Graph Stat 25:762-788
Lee, Hui-Jie; Kishino, Hirohisa; Rodrigue, Nicolas et al. (2016) Grouping substitution types into different relaxed molecular clocks. Philos Trans R Soc Lond B Biol Sci 371:
Wang, Kuangyu; Yu, Shuhui; Ji, Xiang et al. (2015) Roles of solvent accessibility and gene expression in modeling protein sequence evolution. Evol Bioinform Online 11:85-96
Rodríguez, Abel; Quintana, Fernando A (2015) On species sampling sequences induced by residual allocation models. J Stat Plan Inference 157-158:108-120
Estrada, Rolando; Tomasi, Carlo; Schmidler, Scott C et al. (2015) Tree Topology Estimation. IEEE Trans Pattern Anal Mach Intell 37:1688-701
Wang, Hao; Rodríguez, Abel (2014) Identifying pediatric cancer clusters in Florida using loglinear models and generalized lasso penalties. Stat Public Policy (Phila) 1:86-96
Daniels, Kyle G; Tonthat, Nam K; McClure, David R et al. (2014) Ligand concentration regulates the pathways of coupled protein folding and binding. J Am Chem Soc 136:822-5
Herman, Joseph L; Challis, Christopher J; Novák, Ádám et al. (2014) Simultaneous Bayesian estimation of alignment and phylogeny under a joint model of protein sequence and structure. Mol Biol Evol 31:2251-66
Rodríguez, Abel; Martínez, Julissa C (2014) Bayesian semiparametric estimation of covariate-dependent ROC curves. Biostatistics 15:353-69
Rodriguez, Abel; Schmidler, Scott C (2014) BAYESIAN PROTEIN STRUCTURE ALIGNMENT. Ann Appl Stat 8:2068-2095

Showing the most recent 10 out of 18 publications