The `protein problem'has remained unsolved despite decades of research [1, 2]. In principle, one expects that the primary amino acid sequence of a protein determines its structure, function, and evolutionary (SF&E) characteristics. Yet, there still is no reliable method for predicting the native state structure of a protein and its function given only its sequence. In addition, inferring the evolutionary relationships among highly divergent protein sequences is a daunting task. In general, when pairwise sequence alignments between protein sequences fall below 25% identity, statistical measurements do not provide support robust enough to identify clear phylogenetic relationships despite intensive research in this area [1, 3, 4]. The recent explosion in the availability of knowledge bases and computational techniques for the analysis of complex data has created an unprecedented opportunity for teasing out invaluable information from protein sequences. Starting with a basic premise that protein sequence encodes information about SF&E, we developed a unified framework for inferring SF&E from sequence information using a knowledge-based approach in which we measure the similarity between a query sequence and a set of biologically relevant profiles in an unbiased manner. Results from this Gestalt Domain Detection Algorithm-Basic Local Alignment Tool (GDDA-BLAST) provide phylogenetic profiles that have the capacity to model SF&E relationships of various proteins. Indeed, GDDA-BLAST is capable of deriving deep phylogenetic relationships for highly divergent proteins in a quantifiable manner [5, 6]. Preliminary results from our computational case study of the highly divergent family of retroelements accord with those previously reported, and demonstrate that GDDA-BLAST measurements can be treated as "fingerprints" that can be used to derive distance estimates and hence phylogenetic relationships without prior information, multiple sequence alignment, or manual editing. We propose that sequence information present within the "twilight zone" of sequence similarity can provide key insight into SF&E relationships among distantly related and/or rapidly evolving proteins. This proposal aims to push our limits of detecting homology within the "twilight zone" of sequence similarity by evaluating and optimizing GDDA-BLAST performance on benchmark and experimental data sets. Armed with these refined GDDA- BLAST measurements we propose to conduct a comprehensive, ab initio, phylogenetic study of retroelements and RNA dependent RNA polymerases from the positive-strand family of RNA viruses (+ssRNA). Simultaneously we will derive high-resolution maps of domain boundaries and empirically validate functional annotations and predictions of key residues for those activities. This work aims to perform translational research from the computer to the laboratory bench top. We expect that the tools and resources generated from this grant will be accessible and user-friendly to the bench scientist, thereby speeding the discovery process of other clinically relevant research endeavors.

Public Health Relevance

The long-term implication of this proposal is the development of a unified framework for high-resolution and simultaneous measurements of structure, function, and evolution. Should this be possible: (i) functional and evolutionary measurements could quantitatively inform structural modeling to derive accurate atomic resolution protein structures, (ii) structural and functional measurements could inform evolutionary histories to derive accurate evolutionary rates, deep-branch relationships, and homologous spaces within each protein, and (iii) structural and evolutionary measures would inform as to the location of functionalities contained within any protein and the regulatory elements which control these functions. Armed with this information, the speeds at which diseases could be understood and pharmacophores/therapies developed to combat them would likely increase dramatically.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Genetic Variation and Evolution Study Section (GVE)
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Davis
Schools of Medicine
United States
Zip Code
Lindy, Amanda S; Parekh, Puja K; Zhu, Richard et al. (2014) TRPV channel-mediated calcium transients in nociceptor neurons are dispensable for avoidance behaviour. Nat Commun 5:4734
Todd, George K; Boosalis, Casey A; Burzycki, Aaron A et al. (2013) Towards neuronal organoids: a method for long-term culturing of high-density hippocampal neurons. PLoS One 8:e58996
Chintapalli, Sree V; Bhardwaj, Gaurav; Babu, Jagadish et al. (2013) Reevaluation of the evolutionary events within recA/RAD51 phylogeny. BMC Genomics 14:240
Bhardwaj, Gaurav; Ko, Kyung Dae; Hong, Yoojin et al. (2012) PHYRN: a robust method for phylogenetic analysis of highly divergent sequences. PLoS One 7:e34261
Han, Qingxia; Aligo, Jason; Manna, David et al. (2011) Conserved GXXXG- and S/T-like motifs in the transmembrane domains of NS4B protein are required for hepatitis C virus replication. J Virol 85:6464-79
Hong, Yoojin; Chintapalli, Sree Vamsee; Ko, Kyung Dae et al. (2011) Predicting protein folds with fold-specific PSSM libraries. PLoS One 6:e20557
Kiselyov, Kirill; van Rossum, Damian B; Patterson, Randen L (2010) TRPC channels in pheromone sensing. Vitam Horm 83:197-213
Hong, Yoojin; Kang, Jaewoo; Lee, Dongwon et al. (2010) Adaptive GDDA-BLAST: fast and efficient algorithm for protein sequence embedding. PLoS One 5:e13596
Hong, Yoojin; Chalkia, Dimitra; Ko, Kyung Dae et al. (2009) Phylogenetic Profiles Reveal Structural and Functional Determinants of Lipid-binding. J Proteomics Bioinform 2:139-149