New tools based on graph theory have revolutionized genome analyses, providing better ways to identify and classify rearrangements of genomic fragments. The same tools recently also provided a major breakthrough in multiple sequence alignment. Here we propose to apply these tools to protein structure analysis and use the resulting insights into protein structure evolution to increase model quality in comparative modeling. Structure comparison between distant homologs show clearly that the dominant paradigm in structure comparison, that a protein structure could be divided into an invariant core and flexible loops breaks down below 40%-50% sequence identity threshold. Instead, significant rearrangements can happen anywhere in the structure, with secondary structure elements undergoing significant shifts and movements. As a result, standard protocol in comparative modeling, based on sequence mounting on a rigid core structure, must fail for such homologs. Structural differences between homologs are driven, as is the entire folding process, by free energy of the system, but because of serious deficiencies in current force fields and computational approaches, energy-based predictions of such changes are not successful. In this grant we propose to improve the quality of comparative modeling by first discovering and then applying empirical rules of protein structure changes. Rapid growth of the number of known protein structures, fueled in part by technical advances in high throughput structure determination spearheaded by the Protein Structure Initiative, resulted in increasingly dense coverage of the structural space of many folds. This provides a rich learning base to discover such empirical rule, provided a right formalism to describe protein structure changes can be developed. In preliminary analyses we have shown that in a next approximation after the invariant core/flexible loops, protein structure can be described as built from rigid subdomains, and simple rearrangements of these subdomains account for almost half of the structural differences between distant homologs. Moreover, proteins can only adopt structures lying in a specific low dimensionality subspace of the entire conformational space. To improve the quality of models from comparative modeling, we plan to identify conserved subdomains for all known folds and to describe the allowed subspaces by analyzing already known structures from these folds. In the next step we will use this information to generate possible variants of the template structure and use model evaluation tools to identify the one most similar to the [sic].
Showing the most recent 10 out of 21 publications