The most basic tool for studying the protein structure universe is the computer program that compares protein structures, which also yields structure-based sequence alignment. We evaluated existing programs and found that the sequence alignments these programs produce are in error by 20% on average when sequence homology is low and by much larger amounts for some individual cases. (Kim and Lee, BMC Bioinformatics, 2007) Therefore, we devised a new structure alignment procedure, which we call RSE, based on a recently developed """"""""Seed Extension"""""""" algorithm (Tai et al., BMC Bioinformatics, accepted). This procedure will make structure-based sequence alignment both more accurate and faster to calculate. Many protein structures are made of smaller units called domains. When studying protein structures, it is often necessary to deal with a domain at a time. Therefore, parsing a protein structure into domains is another basic operation in a protein structure study. In the past, domain parsing has been made on intuitive criteria and different programs produce different domain sets for the same protein. Even the well-known manually curated protein domain structure databases SCOP and CATH use different domain definitions. In collaboration with Peter Munson at NIH and Jean Garnier and Jean-Francois Gibrat at INRA (France), we are currently working on defining domains on the basis of recurrence of the same or similar structure in other proteins. We hope that this procedure will (a) put the domain definition on a more sound ground, (b) explain the reason for the difference in domain definition in many cases, and (c) relate possible alternate domain definitions to the number and the source of the proteins that have the conserved domain structure and therefore to the evolutionary history of the protein domain structures. Protein structures are complex and difficult to comprehend or describe. In order to help understand these structures and facilitate comparisons, we propose to define a unique coordinate system for each structure that is defined by the protein structure itself, either by the inherent symmetry of the structure or by the orientation and arrangement of the secondary structural elements. We will work to identify symmetric proteins and proteins for which secondary structures define a unique axis. There are only a couple of published procedures for identifying symmetric proteins. We plan to develop a new procedure that uses the RSE procedure (multiple times for each protein) to identify symmetric proteins. We have already developed a procedure for defining a unique axis from the secondary structure elements and found that a unique axis exists for most of the proteins in the Protein Data Bank (PDB) structure database. Evidence that one has achieved a degree of understanding of the protein structures is the ability to classify them in a systematic fashion. In addition, the number of protein structures is increasing exponentially, partly because of the structural genomics activities. These structures must be compared and classified in order to understand them. In collaboration with Dr. Munson of CIT and Drs. Gibrat and Garnier of INRA (France), we devised many machine classification schemes, but when their results were compared with the manually curated classification database, SCOP, we found major discrepancies. (Sam et al., BMC Bioinformatics, 2006, 2008) Some of these differences are clearly due to the inadequacies in the structure comparison programs used. The RSE procedure should nearly eliminate this source of the discrepancy. But others are due to the subjective nature of the existing classification database. Two other major problems with the existing manually curated classification databases are: (1) manual curation is slow and subject to human error, and (2) the classification lacks organizational principle. For example, there is no clear definition of the concept of the """"""""Folds"""""""" in the SCOP database and there is no natural order that one can assign to the different """"""""Folds"""""""" in the database. We hope to produce a new protein structure organization scheme based on the natural coordinate system we will define in this study and which will be made completely automatically. Automatic domain parsing is part of the requirement for this process.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Intramural Research (Z01)
Project #
1Z01BC011000-01
Application #
7733404
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2008
Total Cost
$664,957
Indirect Cost
Name
National Cancer Institute Division of Basic Sciences
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Goonesekere, Nalin C W; Lee, Byungkook (2008) Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins 71:910-9
Sam, Vichetra; Tai, Chin-Hsien; Garnier, Jean et al. (2008) Towards an automatic classification of protein structural domains based on structural similarity. BMC Bioinformatics 9:74