Internally symmetric proteins: Most internally symmetric proteins have a relatively small core unit, which is repeated. These are simple structures compared to proteins that are not symmetric. Yet, they appear to be capable of carrying out all types of functions. Some are enzymes, others are carriers of proteins, still others are receptors, etc. They should be good molecules with which to study the sequence-structure-function relations because of their relative simplicity. The evolutionary history of these proteins is also interesting. These proteins probably arose by gene duplication and fusion. Although mutation rates will vary depending on the requirement of symmetry for function, generally those that have highly sequence similar repeats presumably arose late, compared to those for which the sequence similarity is beginning to disappear. After sufficient time, the sequence similarity will disappear and structural symmetry will also be degraded. Thus, the symmetry should generally give an additional handle for following the evolution of these proteins. Our symmetry detection program, SymD (Kim et al. BMC Bioinformatics 11:303, 2010), is based on two algorithms that we developed earlier, SE (Seed Extension;Tai et al., BMC Bioinformatics, 10 Suppl 1:S4, 2009) and RSE (Refinement with SE;Kim et al. BMC Bioinformatics 10:210, 2009). SE finds the optimal structure-based sequence alignment given a structure superposition without using the dynamic programming algorithm or a gap penalty. RSE uses SE and the Kabsch algorithm to find the optimal structure superposition and structure-based sequence alignment given an initial structure superposition or sequence alignment. SymD itself works by optimally aligning, using RSE, a protein structure to itself after circularly permuting the second copy by k residues for all k values from 1 to N-3 residues where N is the total number of residues of the protein. Using this program, we determined that approximately 20% of all distinct protein domains (SCOP 1.75 ASTRAL 40% domain dataset) may be considered globally symmetric. These include most of the well-known symmetric folds, including TIM barrels, alpha-alpha superhelices and toroids, beta-trefoils, beta-propellers, leucine-rich repeats, ferredoxins, etc. The symmetries observed are broadly of three types: slip, closed, and open. Slip symmetric proteins look invariant after a translation by a few residues in one direction. As far as we know, we are the first to recognize this invariance and to consider it as a type of symmetry (manuscript in preparation). These are mostly helix bundles. In symmetric closed structures, the N- and C-termini of the molecule come close together and the two ends of the molecule are 'stitched'together, often by using a set of hydrogen bonds (the Velcro joining). Most of these have 2- to 8-fold rotational symmetries, but the transmembrane beta-barrels can have higher symmetries and also the screw symmetries. In the symmetric open structures, the N- and C-termini are at the opposite ends of the molecule. All have a helical or a pure 2-fold rotational symmetry. A protein with a pure 2-fold rotational symmetry can have either a closed (intertwined) or an open structure. Current research effort is directed to (1) characterizing the small number of protein domains that have two or more symmetry elements, (2) perfecting the algorithm for automatic classification of observed symmetries, and (3) developing an algorithm for detecting locally symmetric sub-structures that are imbedded in a larger, globally non-symmetric structures. Future efforts will be directed to collecting repeating units and studying their structure and interaction. Protein structure modeling Protein structure modeling involves predicting the three-dimensional structure of protein molecules from their sequence. Protein structure prediction is an important problem in molecular biology and there are many laboratories in the world dedicated to solving this problem. The Critical Assessment of protein Structure Prediction (CASP) is a well-known, public series of biennial experiments designed to objectively measure the progress made by the protein structure prediction community. In these experiments, predictors submit models of proteins before the structures are known and independent assessors evaluate the collected models against their experimental structures without knowing the identity of the predictors. We participated in the 2012 CASP10 experiment as one of the three assessor teams. For this evaluation, we devised and tested three new structure evaluation score functions and designed a new plugin to the Chimera protein structure visualization software that enabled us to visually inspect a large number of models and compare them to the target structure rapidly. Using these tools, we could evaluate over 10,000 "template-free" models by visual inspection over a two-month period. We also evaluated the "contact-assisted" models, which showed a lot of promise as the future of structure modeling. From this exercise, we could gauge first hand the state of the art of protein structure prediction and gained intimate knowledge on available protein structure prediction web servers, which we could use for future structure modeling tasks. Three manuscripts describing this work have been submitted and are in the review process. The Signal Transducer and Activator of Transcription (STAT) proteins are DNA-binding transcription factors, which modulate gene expression in response to cytokines, interferons, and various growth factors through JAK kinases. Activated STAT binds DNA as dimer most of the time, but often also work as a tetramer with variable spacing between the two dimer binding motifs called GAS motif. Dr. Warren Leonard at NHLBI and his colleagues found that tetramerization of STAT5 was critical for cytokine responses (Lin et al., Immunity 36:586-599, 2012). They also identified over 500 sites on mouse genome that bind STAT5 as tetramer. These sites are made of two GAS motifs that are separated by different numbers of base pairs. The histogram of these sites with different GAS motif spacings showed 5 peaks with about 5 bp separation between the peaks. We modeled the tetramer binding to DNA and calculated the probability of forming the tetramer-DNA complex at different spacings between the two dimer binding sites. The probability distribution nearly reproduced the observed histogram. This means that the frequency of genes with STAT tetramer binding ability correlates with the probability of forming the tetramer complex. The full biological meaning of this observation is not yet clear. A manuscript on this work is currently in preparation. Recombinant immunotoxins (RIT) are a group of new anti-cancer agents targeted for specific cell-surface receptors. These are man-made protein molecules, composed of mouse antibody parts fused to a part of a potent bacterial toxin. A major problem with these agents is that the patient develops immune response that neutralizes the RIT. In collaboration with Dr. Ira Pastan, we identified potential B-cell epitopes on the surface of these molecules. Dr. Pastan and his coworkers mutated some of these epitope residues to produce a molecule with substantially reduced immune response (Liu, et al., PNAS 109:11782-7, 2012). In above work, we used an intuitive procedure, mainly based on the degree of solvent exposure, to identify potential B-cell epitopes. We are currently developing a formal algorithm to automate this procedure.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Investigator-Initiated Intramural Research Projects (ZIA)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Cancer Institute Division of Basic Sciences
Zip Code
Tai, Chin-Hsien; Bai, Hongjun; Taylor, Todd J et al. (2014) Assessment of template-free modeling in CASP10 and ROLL. Proteins 82 Suppl 2:57-83
Taylor, Todd J; Tai, Chin-Hsien; Huang, Yuanpeng J et al. (2014) Definition and classification of evaluation units for CASP10. Proteins 82 Suppl 2:14-25
Taylor, Todd J; Bai, Hongjun; Tai, Chin-Hsien et al. (2014) Assessment of CASP10 contact-assisted predictions. Proteins 82 Suppl 2:84-97
Tai, Chin-Hsien; Vincent, James J; Kim, Changhoon et al. (2009) SE: an algorithm for deriving sequence alignment from a pair of superimposed structures. BMC Bioinformatics 10 Suppl 1:S4
Kim, Changhoon; Tai, Chin-Hsien; Lee, Byungkook (2009) Iterative refinement of structure-based sequence alignments by Seed Extension. BMC Bioinformatics 10:210
Goonesekere, Nalin C W; Lee, Byungkook (2008) Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins 71:910-9
Sam, Vichetra; Tai, Chin-Hsien; Garnier, Jean et al. (2008) Towards an automatic classification of protein structural domains based on structural similarity. BMC Bioinformatics 9:74