This project covers two inter-related software development efforts. Both are intended to support curators as well as users of the Conserved Domain Database, CDD, as described separately in project LM000161-04. The first effort entails the development and testing of algorithms intended to produce accurate alignments for diverse protein sequences. The second effort involves development of interactive user-friendly tools for aligning large numbers of homologous but diverse protein sequence fragments, and for identifying the subfamily structure within such a protein domain super-family. Development of structure-based alignment algorithms builds upon our group's earlier work on protein threading, a set of structure prediction methods based on the detection of distant homologous relationships. These methods """"""""thread"""""""" a protein sequence through a structural template, scoring alternative alignments by energy calculations, using contact potentials, and a sequence profile derived from the protein family of the template. The success of these methods was demonstrated at the 1998 CASP3 workshop, where the NCBI team was awarded """"""""first place"""""""" in structure prediction by fold recognition, among over 90 international groups entering the competition. To adapt these methods to the high throughput alignment as needed by CDD curators we have developed more efficient versions of the block-alignment algorithm used in threading. Earlier work has shown that this method produces alignments accurate enough for identification of conserved functional sites, and that information loss relative to the original threading method is minimal. An automated multiple-alignment refinement algorithm, which iteratively applies structure-based alignment on one row at a time, has been thoroughly tested. Its performance suggests that it will benefit the CDD curation effort, and the algorithm is being implemented in Cn3D, a major component of CDD curation software. Cn3D already contains the basic version of the block alignment tool. This software is in daily use by the CDD curator team, and a version with this extended alignment functionality has been widely distributed. Further work has focused on development of the CDTree alignment hierarchy editing system, which is also in daily use by the CDD curator team. This software implements a suite of tools for molecular evolutionary analysis of protein families in an interactive package. It supports generation of phylogenetic sequence trees using several algorithms from the literature, linked to displays of organism taxonomy trees and summaries of overall protein domain architecture. The software also supports an integrated """"""""update"""""""" procedure that automatically searches the daily-updated sequence and structure databases for new members of each family, selecting non-redundant representatives of new similarity and/or taxonomy groups. The CDTree subfamily hierarchy editor communicates seamlessly with the Cn3D alignment editor, allowing curator-users to easily detect and correct """"""""outliers"""""""" caused by alignment or sequence errors. While designed for needs of CDD curators, CDTree may also be used as a simple viewer, showing in intuitive graphical displays the sequence, taxonomic and functional diversity within a CDD family hierarchy. This year CDTree has been exhaustively tested and public release is imminent, together with a new, compatible version of Cn3D. CDD?s web-services are ready to support visualization of phylogenetic sequence trees, which provide evidence for domain hierarchies as defined by CDD curators. CDD?s web-services are also ready to support download and analysis of CD hierarchies, using CDTree as a helper application on the user?s computer. CDTree?s integrated ?update? procedure provides a novel interface to NCBI?s PSI-BLAST program, which lets users search the protein database with customized position-specific score matrices, after providing an opportunity to analyze and refine the intermediate family alignment models with a variety of tools. We are working with the BLAST group to provide direct CDTree launch capability from BLAST results pages, as generated by the BLAST web services.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000045-13
Application #
7148028
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
13
Fiscal Year
2005
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Chakrabarti, Saikat; Lanczycki, Christopher J; Panchenko, Anna R et al. (2006) Refining multiple sequence alignments with conserved core regions. Nucleic Acids Res 34:2598-606
Marchler-Bauer, Aron; Anderson, John B; Cherukuri, Praveen F et al. (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33:D192-6
Wheeler, David L; Barrett, Tanya; Benson, Dennis A et al. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 33:D39-45
Kann, Maricel G; Thiessen, Paul A; Panchenko, Anna R et al. (2005) A structure-based method for protein sequence alignment. Bioinformatics 21:1451-6
Panchenko, Anna R; Kondrashov, Fyodor; Bryant, Stephen (2004) Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci 13:884-92
Panchenko, Anna R (2003) Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res 31:683-9
Marchler-Bauer, Aron; Anderson, John B; DeWeese-Scott, Carol et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31:383-7
Marchler-Bauer, Aron; Panchenko, Anna R; Ariel, Naomi et al. (2002) Comparison of sequence and structure alignments for protein domains. Proteins 48:439-46
Marchler-Bauer, Aron; Panchenko, Anna R; Shoemaker, Benjamin A et al. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30:281-3
Panchenko, Anna R; Bryant, Stephen H (2002) A comparison of position-specific score matrices based on sequence and structure alignments. Protein Sci 11:361-70

Showing the most recent 10 out of 15 publications