Protein three dimensional structures are drawn from the Protein Data Bank, an international collaboration supported in part by the NIH. Records are processed at NCBI to provide precise definitions of sequences, structure, and molecular interactions. Protein structure records are compared to all NCBI protein sequence records using the BLAST algorithm and furthermore compared to one another by the VAST structure-comparison algoirthm. These automated comparisons provide the cross references needed for linking protein and gene sequences in the NCBI collection to the biological function annoation provided by protein structure records.? ? Informatics projects were needed this year to address the """"""""remediation"""""""" project undertaken by the Protein Data Bank. This modified 100% of the over 50,000 structure records in the collection, providing revised sequences and structures for over 30% of the records. This """"""""remediation"""""""" necessitated update of the entire NCBI database and calcuation of new neighboring/similarity relationships for very many structure records. Other informatics projects automated weekly updates and provided improved molecular graphics summaries for structure records in Entrez """"""""docsum"""""""" and """"""""Structure Summary"""""""" displays. A new project still under way clusters molecular interactions observed in related structures, to provide a concise summary of biological functions.? ? The NCBI """"""""Conserved Domains"""""""" Entrez database is in part drawn automatically from protein family alignments prepared by others. These include, for example, the """"""""Pfam"""""""" collection prepared at the Wellcome Trust Sanger Institute and the """"""""Protein Clusters"""""""" prepared at NCBI/IEB. Another component of the """"""""Conserved Domains"""""""" database is expert-curated protein family alignments prepared by the staff of the project. Alignments consistent with known three dimensional structures are prepared using algorithms within the """"""""Cn3D"""""""" program and conserved subfamilies consistent with phylogenetic evidence are derived using the """"""""CDTree"""""""" program. Curators record biological functions as indicated by interactions observed in three dimensional structures within the family or other sources such as experimental studies in the literature. Protien sequences in the NCBI collection are automatically compared to conserved domain records using the """"""""Reverse PSI-BLAST"""""""" algorithm.? ? An informatics project undertaken this year developed an algorithm for linking protien sequences to any subfamily that specifically includes that protein. Cross validation experiments showed that """"""""Reverse PSI-BLAST"""""""" scores within the range shown by the representative sequences in the curated subfamily alignment are near-perfect indicators of subfamily membership. Links from protein sequences to """"""""Conserve Domains"""""""" have been modified to include ranking by this algorithm, so as to present the correct, subfamily-specific biological function first, at the top of the list. Another project includes interaction sites drawn from three dimensional structures in the default """"""""Conseved Domains"""""""" link from sequence records. A project still in progress is to provide protien-structure-based alignments for a large fraction of conserved superfamilies, to improve biological function annotation more rapidly than is possible by phylogenetic characterization of all subfamilies.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000046-16
Application #
7735067
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
16
Fiscal Year
2008
Total Cost
$5,603,975
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Marchler-Bauer, Aron; Anderson, John B; Chitsaz, Farideh et al. (2009) CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res 37:D205-10
Tyagi, Manoj; Shoemaker, Benjamin A; Bryant, Stephen H et al. (2009) Exploring functional roles of multibinding protein interfaces. Protein Sci 18:1674-83
Thompson, Kenneth Evan; Wang, Yanli; Madej, Tom et al. (2009) Improving protein structure similarity searches using domain boundaries based on conserved sequence information. BMC Struct Biol 9:33
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 37:D5-15
Fong, Jessica H; Geer, Lewis Y; Panchenko, Anna R et al. (2007) Modeling the evolution of protein domain architectures using maximum parsimony. J Mol Biol 366:307-15
Marchler-Bauer, Aron; Anderson, John B; Derbyshire, Myra K et al. (2007) CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res 35:D237-40
Madej, Thomas; Panchenko, Anna R; Chen, Jie et al. (2007) Protein homologous cores and loops: important clues to evolutionary relationships between structurally similar proteins. BMC Struct Biol 7:23
Wang, Yanli; Addess, Kenneth J; Chen, Jie et al. (2007) MMDB: annotating protein sequences with Entrez's 3D-structure database. Nucleic Acids Res 35:D298-300
Kann, Maricel G; Sheetlin, Sergey L; Park, Yonil et al. (2007) The identification of complete domains within protein sequences using accurate E-values for semi-global alignment. Nucleic Acids Res 35:4678-85
Chakrabarti, Saikat; Bryant, Stephen H; Panchenko, Anna R (2007) Functional specificity lies within the properties and evolutionary changes of amino acids. J Mol Biol 373:801-10

Showing the most recent 10 out of 19 publications