Combining information from the vast body of protein sequences within the framework of protein structures enables the deeper comprehension of the complex effects of amino acid substitutions. Compiling the sequence correlations within protein structural domains will lead to better distinguishing between neutral and deleterious changes. Protein structures provide the frameworks for understanding the sequence data, through physical proximity of directly interacting amino acids and in the manifestation of allostery. This will transform sequence matching from a 1-D process to a 3-D process. Due to the rapid advances in sequencing, the large numbers of available genomes now provide hundreds of millions of protein sequences, and similar advances in structural biology now provide 100,000+ protein structures. By combining these data, our preliminary results show that accounting for the pairwise correlations in the sequence for pairs, closely interacting in the protein structures, immediately yields enhanced ability to identify similar structures by means of sequence matching. Other preliminary data show that function identification by sequence matching is also improved. Such improved homolog identification can lead to progress in structure prediction. The overarching goal here is to apply a deep knowledge of protein structure, together with the analyses of the available sequence data, to the important problem of protein sequence matching. We take an entirely new, highly innovative and uniquely multi-faceted approach for this important problem. It is well established that physical factors such as amino acid dense packing, and other physical aspects of structures affect the conservation of amino acids, and these are accounted for in the new approaches taken here to sequence matching. The rationale is that protein structures provide the physical information and the framework for improving sequence matching to incorporate aspects of 3-D structure and allostery into sequence matching. Accounting for protein flexibility and conformational dynamics will further broaden the investigated conformational space, as well as provide a better understanding of the correlations important for sequence evolution. Results from this project will improve the practice of molecular biology, particularly the identification of functions of proteins having no assigned function, and this is certain to have major impacts upon the understanding of evolution. This project will apply innovative new methods for extracting correlations in sequence, structure and dynamics, by datamining of sequences and structures. The novel structure-based approaches will enable major advances in sequence matching that will be implemented and disseminated on new web servers, made available to anyone. The outcomes of the project will enable any scientist to discriminate significantly more effectively between similar and dissimilar sequences. This better discrimination is essential for better function prediction, for the better understanding of evolution, for better identification of non-functional protein mutants, for improved protein design, for medical diagnosis, and for medical practice in the era of individual patient genomes. 1

Public Health Relevance

Individualized medicine will rely on gene sequencing and knowledge in the era of patient genomes; understanding rapidly the differences among various mutant behaviors becomes a critical element for diagnoses and for developing individual therapies. Combining Big Data from protein sequences and structures will computationally enable the understanding of the effects of mutations by means of structure-principled sequence matching. Our project will develop robust new tools for use in precision medicine, and thus will directly and broadly impact public health. 1

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Macromolecular Structure and Function D Study Section (MSFD)
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Iowa State University
Organized Research Units
United States
Zip Code