With the Conserved Domain Database (CDD) resource we are producing a database of expert-curated protein domain alignments. Such alignment models describe the sequence and 3D-structure conservation within protein families, facilitating the annotation of conserved functional features. The alignment models also describe the variability present in a domain family, facilitating the depiction of its functional diversity. This project describes curation of CDD alignments by human experts. The role of the CDD curators is multifaceted. First of all they must survey relevant scientific literature, to produce concise summaries of the known functions of each domain family, to study existing sub-family classifications, and to choose citations useful to users of NCBI?s web-based classification resources. Curators must also examine the results of automated sequence and structure comparison to infer the location of conserved core blocks, an iterative process that requires judgment with respect to elimination of incomplete or erroneous sequence and structure data. Curators must also identify apparent orthology groups, based on the consensus of results from alternative molecular evolution and clustering methods. The curator group has so far produced about 1500 curated CDD families. Both curated and un-curated multiple sequence alignments are used to generate position-specific scoring matrices (PSSMs), which may in turn be used in NCBI's web-based protein classification resources. A number of NCBI information services use CDD to identify conserved domains within protein sequences. Links to CDD are made, for example, by default from: 1) NCBI?s protein-BLAST resource, www.ncbi.nlm.nih.gov/BLAST/ 2) proteins in NCBI?s Entrez browser, www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein 3) records in NCBI?s HomoloGene system, www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=homologene. Further information about CDD and these search services is available at www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. Curated domain models summarize the known functions of family members, using relevant citations from PubMed when possible, and may link to resources on the NCBI Bookshelf for further information. They also provide site-specific functional annotation, via sequence and structure alignments and via pre-recorded evidence-based features, such as interaction or active sites. The CDD alignment curation project differs from comparable efforts, upon which it builds, in two fundamental ways: (i) 3D-structure information is used in a quantitative way, whenever possible, to guide the alignments, and (ii) an explicit hierarchy of families and subfamilies, related by descend from a common ancestor, reflects the evolutionary history of each domain super-family. When at least one 3D structure is known within a domain family, this information is used to define the conserved homologous core structure, a set of un-gapped blocks that must be identified in all representative sequences included in the alignment. Representative sequences are aligned to this core structure using structure-informed alignment algorithms or, when multiple 3D structures are known, alignments obtained from structure superposition. These procedures assure high alignment accuracy, as needed for accurate transfer of annotation to new family members identified by searching. Representative sequences are picked from a set of ?preferred taxonomy nodes?, so that the domain alignments represent the taxonomic span of a family, which in turn indicates its apparent evolutionary age. Explicit hierarchies identify major gene duplication events in the molecular evolution of each family. Our basic strategy is to use domain-sequence clustering methods together with known domain architecture and phylogeny to identify what appear to be ancient orthology groups. These define explicitly annotated """"""""children"""""""" of the overall """"""""parent"""""""" alignment, and in turn provide more specific functional annotation. The CDD project employs a high level of automation, to produce structure-based alignments, to identify candidate orthology groups, to update CDD alignments with new sequences and structures, and to """"""""publish"""""""" the results to web servers. These algorithms and associated software required are d

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000161-04
Application #
7148047
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
4
Fiscal Year
2005
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Fong, Jessica H; Geer, Lewis Y; Panchenko, Anna R et al. (2007) Modeling the evolution of protein domain architectures using maximum parsimony. J Mol Biol 366:307-15
Marchler-Bauer, Aron; Anderson, John B; Cherukuri, Praveen F et al. (2005) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33:D192-6
Wheeler, David L; Barrett, Tanya; Benson, Dennis A et al. (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 33:D39-45
Marchler-Bauer, Aron; Bryant, Stephen H (2004) CD-Search: protein domain annotations on the fly. Nucleic Acids Res 32:W327-31
Marchler-Bauer, Aron; Anderson, John B; DeWeese-Scott, Carol et al. (2003) CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31:383-7
Marchler-Bauer, Aron; Panchenko, Anna R; Ariel, Naomi et al. (2002) Comparison of sequence and structure alignments for protein domains. Proteins 48:439-46
Marchler-Bauer, Aron; Panchenko, Anna R; Shoemaker, Benjamin A et al. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30:281-3
Panchenko, Anna R; Bryant, Stephen H (2002) A comparison of position-specific score matrices based on sequence and structure alignments. Protein Sci 11:361-70
Geer, Lewis Y; Domrachev, Michael; Lipman, David J et al. (2002) CDART: protein homology by domain architecture. Genome Res 12:1619-23