Protein three-dimensional structures are drawn from the Protein Data Bank (PDB), an international database collaboration supported in part by the NIH. PDB records are processed at NCBI to provide Molecular Modeling Database (MMDB) records with precise definitions of the component biological macromolecules and chemicals, and their interactions as indicated by atomic contacts in three-dimensional structure. Protein structure records are compared to NCBI protein sequence records using the Basic Local Alignment Search Tool algorithm (BLAST) and compared to one another by the Vector Alignment Search Tool structure-comparison algorithm (VAST). Protein sequences in the NCBI collection are also compared to protein family records in the Conserved Domain Database (CDD) using the Reverse Position-Specific BLAST algorithm (RPSB). These automated comparison methods provide the cross references needed to link protein and gene sequences in NCBI's extensive collection to the biological function annotation provided by protein structures. Informatics projects were needed this year to address further """"""""remediation"""""""" undertaken by PDB. PDB once again modified 100% of the over 50,000 files in their collection, providing the bonded-atom connectivity (chemical graphs) of component molecules in revised format. These files could be used only by modifying NCBI's algorithms to correctly process """"""""remediated-PDB-files"""""""". Other informatics projects improved molecular graphics and annotation in NCBI Entrez-Structure """"""""Document Summary"""""""" and """"""""Structure Summary"""""""" displays. Research to identify conserved molecular interactions observed among related protein structures has continued. The goal is to identify biologically relevant and informative interactions, none of which are explicitly annotated by PDB. The Inferred Biomolecular Interactions Server (IBIS) is now available at www.ncbi.nlm.nih.gov/Structure/ibis/ibis.cgi. Further research aims to explicitly record observed interactions in MMDB as atomic contact counts and footprints on biological macromolecules. The goal is to rank informative interactions to the top of lists shown in Entrez-Structure displays, with automatic flagging of evolutionarily conserved interactions identified by IBIS. The NCBI Conserved Domains database CDD is in part derived from comprehensive protein sequence alignment collections prepared automatically by others. These include, for example, the Pfam collection prepared at the Wellcome Trust Sanger Institute and the Protein Clusters collection prepared at NCBI/IEB. More important contributions to CDD are the expert-curated protein family alignments prepared by staff of the project. Very accurate protein family alignments consistent with known three-dimensional structures and structural superpositions are prepared using algorithms within the """"""""See in Three Dimensions"""""""" program (Cn3D). Conserved subfamilies consistent with evolutionary evidence are derived using phylogenetic tree algorithms and graphics within the """"""""Conserved Domain Tree"""""""" program (CDTree). Curators save into CDD records the phylogenetic trees identifying ancient conserved subfamilies and the biological function annotation derived from interactions observed in three-dimensional structures within each subfamily and/or from other observations such as subfamily-specific experimental studies reported in the literature. A new informatics project undertaken this year aimed to automate refinement of CDD subfamily alignments by rapidly performing multiple realignments of member sequences. This proved successful in reducing curator effort and is now in daily use. Another new informatics project is a user-requested batch CD-search service, allowing users to easily retrieve CDD alignments and functional annotation for large groups of sequences. The service appears successful and release is expected soon. A research project on automated identification of ancient conserved multi-domain architectures was initiated this year and appears successful. The goal is to support efficient construction of ancient conserved multi-domain CDD records based on previously-curated alignments of component domains. This will allow curators to provide multi-domain-specific functional annotation without the necessity to edit already-accurate alignments. Another research project initiated this year aims to automate identification of ancient conserved subfamilies where accurate biological function annotation is likely, given known three-dimensional structures and/or literature citations for subfamily member sequences. If successful, this procedure will reduce the curator time required to browse often-large phylogenetic trees produced by CDTree, supporting efficient identification of subfamilies where functional annotation is most possible and worthwhile.

Project Start
Project End
Budget Start
Budget End
Support Year
17
Fiscal Year
2009
Total Cost
$6,542,075
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Marchler-Bauer, Aron; Derbyshire, Myra K; Gonzales, Noreen R et al. (2015) CDD: NCBI's conserved domain database. Nucleic Acids Res 43:D222-6
Kim, Sunghwan; Han, Lianyi; Yu, Bo et al. (2015) PubChem structure-activity relationship (SAR) clusters. J Cheminform 7:33
Kim, Sunghwan; Thiessen, Paul A; Bolton, Evan E et al. (2015) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res 43:W605-11
Hähnke, Volker D; Bolton, Evan E; Bryant, Stephen H (2015) PubChem atom environments. J Cheminform 7:41
Madej, Thomas; Lanczycki, Christopher J; Zhang, Dachuan et al. (2014) MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res 42:D297-303
Wang, Yanli; Suzek, Tugba; Zhang, Jian et al. (2014) PubChem BioAssay: 2014 update. Nucleic Acids Res 42:D1075-82
Hao, Ming; Wang, Yanli; Bryant, Stephen H (2014) An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal Chim Acta 806:117-27
Pan, Yongmei; Cheng, Tiejun; Wang, Yanli et al. (2014) Pathway analysis for drug repositioning based on public database mining. J Chem Inf Model 54:407-18
Cheng, Tiejun; Pan, Yongmei; Hao, Ming et al. (2014) PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today 19:1751-1756
Marchler-Bauer, Aron; Zheng, Chanjuan; Chitsaz, Farideh et al. (2013) CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res 41:D348-52

Showing the most recent 10 out of 36 publications