Protein structures cluster into families of folds that act as evolutionary templates, where the backbone structures are recycled to create proteins with different functions. Theoretical models of protein evolution propose the existence of a "neutral network" in sequence space that interconnects the different sequences encoding the same fold. In such networks, continuous paths of point mutations or small insertion/deletion changes connect all component sequences. These networks can be quite large, since many different sequences can encode the same fold. A fundamental question in protein evolution is how and how often exchanges between different neutral networks occur, leading to evolution from one fold to another. During evolution, the induction of new phenotypic traits by a small number of mutations has to be balanced against the deleterious effects on vital functions that mutations can cause. There is evidence that molecular evolution may be steered by the ability of biomolecules to take on numerous conformations as a bridge between different folds. Under this scenario, an evolving protein can initially attain increased fitness for a new function without losing its original function. Bridge states allow proteins to explore new structures and functions while part of the structural ensemble retains the initial conformation and function as insurance. Of particular interest are rare bridge or transition sequences that fold with different probabilities into distinct non-overlapping structures. Protein stability is generally viewed in terms of a two-state transition between a unique native state and an ensemble of unfolded ones. However, design and mutagenesis experiments suggest that the difference in free energy of alternative folds may be much smaller than typically envisioned, leading to evolutionary rates that are sensitive to the free energy differences between alternative conformations. This research centers on how protein folds change over time, on how the existence of alternative folds affects the rates of protein evolution, and on how new protein functionality can evolve from an already existing protein. Emphasis is given to the characterization of the conformational space, evolutionary pathways, connections, and properties of select protein sequences that, in principle, can stably adopt two different folds, or exist in one fold but are 1 to 3 mutations away from a different fold. Specific protein systems that show a very high sequence identity but display different folds and functions will be considered, such as alternate folds based on the patterning of polar versus non-polar amino acids in the P22 Arc repressor homodimer; and engineered proteins based on the GA and GB domains of the cell wall Protein G of Streptococcus bacteria. The study of these protein systems by atomistic molecular dynamic techniques will be complemented by molecular evolution analyses of diverse protein families, especially those for which the evolutionary impact of tertiary structure has previously been investigated without regard to the potential role of alternative protein folds on evolutionary rates. A novel evolutionary inference procedure that can quantitatively assesses the evolutionary influence of alternative folds will facilitate these investigations. The most interesting of putative ancestral proteins that are identified by the inference procedure will then be studied in more detail with atomistic modeling techniques that will examine all relevant structural characteristics.

This project will foster interaction between the molecular simulation and evolution research communities, which have traditionally been largely isolated from each other. This is primarily a student/postdoc based research project, which will foster educational ties between NC State Bioinformatics and Physics. The larger computational biomolecular community will benefit through the continued development of freely available software for the AMBER package, as well as through the development of new evolutionary inference software. In addition, The PI will develop new graduate courses, foster the retention and recruitment of minority students, develop the Biophysics option at NC State, provide international research experience to students (mainly via a collaboration in Japan), and provide for a well-rounded and rich environment for students and research partners at all educational levels.

Project Report

This project focused on the physio-chemical characterization of different protein folds arising from sequences that are only a small number of mutations away, with the aim of gaining insight into the molecular evolution of these proteins. The research made use of large-scale biomolecular simulations based on classical molecular dynamics (MD). The study of selected protein systems by atomistic MD techniques is complemented by molecular evolution analysis of diverse protein families, especially for those for which the evolutionary impact of tertiary structure has previously been investigated without regards to the potential role of alternative proteins folds on evolutionary rates. The investigations were facilitated by a novel evolutionary inference procedure for quantitative assessment of the evolutionary influence of alternate folds. Some of the major accomplishments associated with this grant are the following. We have successfully implemented our new structural-and-functional constained protein evolution models as plugins for the newly released Bayesian Evolutionary Analysis by Sampling Trees or BEAST2 software package. This allows, for the first time, for the BEAST2 users to perform statistical inference with probabilistic codon models that take into account both protein structure and protein-coding gene expression. The extensions of BEAST2 were taken in two different directions. The first was to capture some element of the protein structure (e.g., relative solvent accessibility of different positions within the structure) and gene expression level of a protein, all while retaining the conventional assumptions that protein sequence positions (or codons) independently evolve. This independence assumptions makes calculation of the likelihoods computationally efficient via the pruning algorithm of Felsenstein (1981). The second assumption is more biologically realistic and statistically desirable, but comes at a computational cost. We our second extension, protein positions evolve in a dependent fashion due our incorporation of pairwise interaction between amino acids. The source code for our BEAST2 plugins may be found at https://github.com/learking/finalEVOIND, the binary executables are found at http://www4.ncsu.edu/~kwang2/EVOID.addon.jar; and these may be manually installed by adding the *.jar files to BEAST2’s default package installation directory (see www.beast2.org/wiki/index/php/Managing_Packages). In terms of MD simulations, some of the most interesting results emerged from our investigations of the functional specificity of the Pdx1 homeodomain. In brief, the pancreatic and duodenal homeobox 1 (Pdx1) is a transcription factor that plays an important role in pancreatic endocrine/exocrine cell development and maintenance of adult islet beta-cell functions. Mutations in Pdx1 cause a form of familial diabetes, and maturity-onset diabetes of the young type 4. Although there have been many structural studies of the DNA binding properties of homeodomains, the factors behind the binding specificity are still difficult to elucidate. A crystal structure of the Pdx1 homeodomain bound to DNA (PDB 2H1K) shows two complexes with differences in the conformation of the N-termianl arm, major groove contacts and backbone contacts raising new questions about the DNA recognition processes by homeodomains. We carried out fully atomistic MD simulations, both in a crystal and solvated environments in order to elucidate the nature of the binding contacts. Our results indicate that the system is characterized by stable binding confomers, as previously identified experimentally. These results were surprising, because it had been assumed that proteins recognize DNA by finding their lowest energy state. Ours study indicates that transcription factor may bind DNA in an ensemble of conformers. This is also important from an evolutionary standpoint. Multiple functions of a transcription factor could, in principle, be achieved via point mutations in the N-terminal, or through a ``binding polymorphism" involving no mutations. Thus, in principle one or two point mutations are enough to alter the binding properties and therefore the specificity of the transcription factors. Other important investigations carried out involved studies of the alternative folds based on the patterning of the polar versus non-polar amino acids in the Arc repressor homodimer of bacteriophage P22, an analysis of the structure, free energies and so-called ``PPII" propensity of amino acid guests in proline-rich peptides, an investigation of how C-terminal proline segments alter the structural conformations of polyglutamine peptides, an atomistic study of the B-to Z transition, and an investigation of the structure of the gp41(659-671) HIV-I antibody epitope along with an assessment of the secondary structure assignments of the codes STRIDE, DSSP and KAKSI in current use. In terms of the broader impacts, this research grant provided support for two graduate students and partial support for two postdoctoral fellows, who received training in the area of atomistic simulation and statistical methods. In addition, the PIs taught graduate-level courses which integrated their current research, and also helped organize a topical conference and a special symposium at a major scientific meeting.

Agency
National Science Foundation (NSF)
Institute
Division of Molecular and Cellular Biosciences (MCB)
Application #
1021883
Program Officer
Kamal Shukla
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$600,000
Indirect Cost
Name
North Carolina State University Raleigh
Department
Type
DUNS #
City
Raleigh
State
NC
Country
United States
Zip Code
27695