We have developed computer methods to compare a protein's sequence with a library of """"""""folds"""""""" from the structural database. The sequence is """"""""threaded"""""""" through alternative structures, and those most compatible are identified by energy calculations, using contact potentials. Since they directly detect structural similarity, threading methods can identify very distant evolutionary relationships that may be undetectable by sequence comparison. Research has focused on testing of the core-element threading method, in blind predictions and control experiments, and on algorithmic improvements to increase sensitivity. Control experiments using known structures identified thresholds for successful fold recognition and accurate modeling: the similar """"""""core"""""""" substructure must comprise 60% or more of the protein and must superpose to a residual of 2.5 Angstroms or less, such that a large fraction of contacts are preserved. Analysis of predictions for the 1996 CASP2 workshop (Critical Assessment of Structure Prediction) confirmed this conclusion. Structural similarity can be less extensive in some cases of distant relationship, however, and several improvements to increase sensitivity have been considered. New definitions of the """"""""core"""""""" of database structures, according to the regions superimposable in homologs with known structures, has been show to reduce false negatives in threading predictions. Combination of contact potentials with sequence-motif scores was also shown to increases sensitivity in difficult recognition problems. Use of rigorous p-value calculations was shown to reduce false positives. With these improvements fold recognition may be expected to reliably detect a greater proportion of the distant evolutionary relationships. This has been demonstrated at the 1998 CASP3 workshop, where the NCBI team was awarded """"""""first place"""""""" in fold recognition, among over 90 international groups entering the competition. The threading methods developed in this project are now being applied to construction of a conserved domain database (CDD). Seed domain alignments, derived from sequence comparison, are mapped onto known 3D structures and compared to 3D structure alignments, to define a core-structure alignment for a sample of representative domains. These alignments are validated by threading calculations, and additional representative sequences detected by RPS-BLAST scanning are merged into the alignment by threading. A newly developed algorithm for sequence vs. PSSM (position specific score matrix) alignment using core-element """"""""blocks"""""""" has greatly speeded these calculations, and made core-element alignment into a practical tool for construction of curated protein domain alignments. CDD alignments serve as a protein classification system for public information retreival services. Domains with conserved structure and function are easily identified, and visualization of the resulting sequence/structure alignments provides a detailed annotation of structure-function relationships. Work this year has focussed on construction of the CDTree alignment heirarchy editing system. Versions 1 and then 2 were deployed to the CDD curation team, and release to the public is anticipated next year.
Chakrabarti, Saikat; Lanczycki, Christopher J; Panchenko, Anna R et al. (2006) Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res 34:2598-606 |
Marchler-Bauer, Aron; Anderson, John B; Cherukuri, Praveen F et al. (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33:D192-6 |
Wheeler, David L; Barrett, Tanya; Benson, Dennis A et al. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 33:D39-45 |
Kann, Maricel G; Thiessen, Paul A; Panchenko, Anna R et al. (2005) A structure-based method for protein sequence alignment. Bioinformatics 21:1451-6 |
Panchenko, Anna R; Kondrashov, Fyodor; Bryant, Stephen (2004) Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci 13:884-92 |
Panchenko, Anna R (2003) Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res 31:683-9 |
Marchler-Bauer, Aron; Anderson, John B; DeWeese-Scott, Carol et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31:383-7 |
Panchenko, Anna R; Bryant, Stephen H (2002) A comparison of position-specific score matrices based on sequence and structure alignments. Protein Sci 11:361-70 |
Geer, Lewis Y; Domrachev, Michael; Lipman, David J et al. (2002) CDART: protein homology by domain architecture. Genome Res 12:1619-23 |
Marchler-Bauer, Aron; Panchenko, Anna R; Ariel, Naomi et al. (2002) Comparison of sequence and structure alignments for protein domains. Proteins 48:439-46 |
Showing the most recent 10 out of 15 publications