Protein three-dimensional structures are drawn from the Protein Data Bank (PDB), an international database collaboration supported in part by the NIH. PDB records are processed at NCBI to provide Molecular Modeling Database (MMDB) records with precise definitions of the component biological macromolecules and chemicals, and their interactions as indicated by atomic contacts in three-dimensional structure. Protein structure records are compared to NCBI protein sequence records using the Basic Local Alignment Search Tool algorithm (BLAST) and compared to one another by the Vector Alignment Search Tool structure-comparison algorithm (VAST). Protein sequences in the NCBI collection are also compared to protein family records in the Conserved Domain Database (CDD) using the Reverse Position-Specific BLAST algorithm (RPSB). These automated comparison methods provide the cross references needed to link protein and gene sequences in NCBI's extensive collection to the biological function annotation provided by protein structures. Informatics projects were needed this year to address ongoing "remediation" undertaken by PDB, who have more than once modified 100% of the over 85,000 files in their collection. "Remediation" of the bonded-atom connectivity (chemical graphs) of component molecules has fortunately become less frequent. And additional, potentially useful information is now provided, such as "SPLIT" records indicating when a given crystal structure has been divided into two or more PDB files, and "Biological Assembly" information indicating a subset of biologically important interactions. Unfortunately the additional information is encoded in "REMARK" text without documented formats, requiring considerable effort to develop reliable text-parsing algorithms. But correction of PDB "SPLIT" files and display of "Biological Assembly" information on MMDB "Document Summary" and "Structure Summary" pages has progressed, and was publicly released last year. Research to identify molecular interactions not included in the PDB "Biological Assembly" has continued. This includes identification of contact thresholds for biologically relevant protein-ligand complexes such as heme in hemoglobin, and interactions observed among related protein structures but not mentioned the PDB files of those structures. Research is in progress to modify the VAST algorithm to accurately align/superimpose the oligomeric 3D structures of "Biological Assemblies." The NCBI Conserved Domains database CDD is in part derived from comprehensive protein sequence alignment collections prepared automatically by others. These include, for example, the Pfam collection prepared at the Wellcome Trust Sanger Institute and the Protein Clusters collection prepared at NCBI/IEB. More important contributions to CDD are the expert-curated protein family alignments prepared by staff of the CDD project. Very accurate protein family alignments consistent with known three-dimensional structures and structural superpositions are prepared using algorithms within the "See in Three Dimensions" program (Cn3D). Conserved subfamilies consistent with evolutionary evidence are derived using phylogenetic tree algorithms and graphics within the "Conserved Domain Tree" program (CDTree). Curators save into CDD records the phylogenetic trees identifying ancient conserved subfamilies and the biological function annotation derived from interactions observed in three-dimensional structures within each subfamily and/or from other observations such as subfamily-specific experimental studies reported in the literature. CDD informatics projects have continued this year. An algorithm to automate refinement of CDD subfamily alignments by rapidly performing multiple realignments of member sequences has continued to prove useful in reducing curator effort. A batch CD-search service supporting easy retrieval of CDD alignments and functional annotationsfor large groups of sequences has now been released and is in wide use. A research project on automated identification of ancient conserved multi-domain architectures has also continued and appears successful. The goal is to support efficient construction of ancient conserved multi-domain CDD records based on previously-curated alignments of component domains. This will allow curators to provide multi-domain-specific functional annotation without the necessity to edit already-accurate alignments. Another research project continued this year aims to automate identification of ancient conserved subfamilies where accurate biological function annotation is likely, given known three-dimensional structures and/or literature citations for subfamily member sequences. If successful, this procedure will reduce the curator time required to browse often-large phylogenetic trees produced by CDTree, supporting efficient identification of subfamilies where functional annotation is most possible and worthwhile.

Project Start
Project End
Budget Start
Budget End
Support Year
21
Fiscal Year
2013
Total Cost
$4,170,899
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Madej, Thomas; Lanczycki, Christopher J; Zhang, Dachuan et al. (2014) MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res 42:D297-303
Pan, Yongmei; Cheng, Tiejun; Wang, Yanli et al. (2014) Pathway analysis for drug repositioning based on public database mining. J Chem Inf Model 54:407-18
Marchler-Bauer, Aron; Zheng, Chanjuan; Chitsaz, Farideh et al. (2013) CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res 41:D348-52
Kim, Sunghwan; Bolton, Evan E; Bryant, Stephen H (2013) PubChem3D: conformer ensemble accuracy. J Cheminform 5:1
Pan, Yongmei; Wang, Yanli; Bryant, Stephen H (2013) Pharmacophore and 3D-QSAR characterization of 6-arylquinazolin-4-amines as Cdc2-like kinase 4 (Clk4) and dual specificity tyrosine-phosphorylation-regulated kinase 1A (Dyrk1A) inhibitors. J Chem Inf Model 53:938-47
Tyagi, Manoj; Shoemaker, Benjamin A; Bryant, Stephen H et al. (2009) Exploring functional roles of multibinding protein interfaces. Protein Sci 18:1674-83
Thompson, Kenneth Evan; Wang, Yanli; Madej, Tom et al. (2009) Improving protein structure similarity searches using domain boundaries based on conserved sequence information. BMC Struct Biol 9:33
Marchler-Bauer, Aron; Anderson, John B; Chitsaz, Farideh et al. (2009) CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res 37:D205-10
Tang, Ke; Pugalenthi, Ganesan; Suganthan, P N et al. (2009) Prediction of functionally important sites from protein sequences using sparse kernel least squares classifiers. Biochem Biophys Res Commun 384:155-9
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 37:D5-15

Showing the most recent 10 out of 11 publications