Protein three-dimensional structures are drawn from the Protein Data Bank (PDB), an international database collaboration supported in part by the NIH. PDB records are processed at NCBI to provide Molecular Modeling Database (MMDB) records with precise definitions of the component biological macromolecules and chemicals, and their interactions as indicated by atomic contacts in three-dimensional structure. Protein structure records are compared to NCBI protein sequence records using the Basic Local Alignment Search Tool algorithm (BLAST) and compared to one another by the Vector Alignment Search Tool structure-comparison algorithm (VAST). Protein sequences in the NCBI collection are also compared to protein family records in the Conserved Domain Database (CDD) using the Reverse Position-Specific BLAST algorithm (RPSB). These automated comparison methods provide the cross references needed to link protein and gene sequences in NCBI's extensive collection to the biological function annotation provided by protein structures. Informatics projects were needed this year to address ongoing remediation undertaken by PDB, who have more than once modified 100% of the over 85,000 files in their collection. Remediation of the bonded-atom connectivity (chemical graphs) of component molecules has fortunately become less frequent. And additional, potentially useful information is now provided, such as SPLIT records indicating when a given crystal structure has been divided into two or more PDB files, and Biological Assembly information indicating a subset of biologically important interactions. Unfortunately the additional information is encoded in REMARK text without documented formats, requiring considerable effort to develop reliable text-parsing algorithms. But correction of PDB SPLIT files and display of Biological Assembly information on MMDB Document Summary and Structure Summary pages has progressed, and was publicly released last year. Beginning this year, the SPLIT PDB files have finally disappeared, being replaced by the new-format mmCIF files, which also now include large assemblies like viral capsids that previously required 10 or more SPLIT PDB files. The new mmCIF format forced development of new parsing algorithms which are now in place, and have been validated by quantitative comparison of MMDB records derived from parsing matched PDB-format and mmCIF-format files. Research to identify molecular interactions not included in the PDB Biological Assembly has continued, in part because large assemblies in mmCIF files contain no biological unit definitions. This includes identification of contact thresholds for biologically relevant protein-ligand complexes such as heme in hemoglobin, and interactions observed among related protein structures but not mentioned the PDB files of those structures. Research is in progress to modify the VAST algorithm to accurately align/superimpose the oligomeric 3D structures of Biological Assemblies, and the public Related Structures displays now distinguish Related Assemblies and Related Proteins. The NCBI Conserved Domains database CDD is in part derived from comprehensive protein sequence alignment collections prepared automatically by others. These include, for example, the Pfam collection prepared at the Wellcome Trust Sanger Institute and the Protein Clusters collection prepared at NCBI/IEB. More important contributions to CDD are the expert-curated protein family alignments prepared by staff of the CDD project. Very accurate protein family alignments consistent with known three-dimensional structures and structural superpositions are prepared using algorithms within the See in Three Dimensions program (Cn3D). Conserved subfamilies consistent with evolutionary evidence are derived using phylogenetic tree algorithms and graphics within the Conserved Domain Tree program (CDTree). Curators save into CDD records the phylogenetic trees identifying ancient conserved subfamilies and the biological function annotation derived from interactions observed in three-dimensional structures within each subfamily and/or from other observations such as subfamily-specific experimental studies reported in the literature. CDD informatics projects have continued this year. An algorithm to automate refinement of CDD subfamily alignments by rapidly performing multiple realignments of member sequences has continued to prove useful in reducing curator effort. A batch CD-search service supporting easy retrieval of CDD alignments and functional annotationsfor large groups of sequences has now been released and is in wide use. A research project on automated identification of ancient conserved multi-domain architectures has also continued and appears successful. The goal is to support efficient construction of ancient conserved multi-domain CDD records based on previously-curated alignments of component domains. This will allow curators to provide multi-domain-specific functional annotation without the necessity to edit already-accurate alignments. Another research project continued this year aims to automate identification of ancient conserved subfamilies where accurate biological function annotation is likely, given known three-dimensional structures and/or literature citations for subfamily member sequences. If successful, this procedure will reduce the curator time required to browse often-large phylogenetic trees produced by CDTree, supporting efficient identification of subfamilies where functional annotation is most possible and worthwhile.

Project Start
Project End
Budget Start
Budget End
Support Year
24
Fiscal Year
2016
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Marchler-Bauer, Aron; Derbyshire, Myra K; Gonzales, Noreen R et al. (2015) CDD: NCBI's conserved domain database. Nucleic Acids Res 43:D222-6
Kim, Sunghwan; Han, Lianyi; Yu, Bo et al. (2015) PubChem structure-activity relationship (SAR) clusters. J Cheminform 7:33
Kim, Sunghwan; Thiessen, Paul A; Bolton, Evan E et al. (2015) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res 43:W605-11
Hähnke, Volker D; Bolton, Evan E; Bryant, Stephen H (2015) PubChem atom environments. J Cheminform 7:41
Madej, Thomas; Lanczycki, Christopher J; Zhang, Dachuan et al. (2014) MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res 42:D297-303
Wang, Yanli; Suzek, Tugba; Zhang, Jian et al. (2014) PubChem BioAssay: 2014 update. Nucleic Acids Res 42:D1075-82
Hao, Ming; Wang, Yanli; Bryant, Stephen H (2014) An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal Chim Acta 806:117-27
Pan, Yongmei; Cheng, Tiejun; Wang, Yanli et al. (2014) Pathway analysis for drug repositioning based on public database mining. J Chem Inf Model 54:407-18
Cheng, Tiejun; Pan, Yongmei; Hao, Ming et al. (2014) PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today 19:1751-1756
Marchler-Bauer, Aron; Zheng, Chanjuan; Chitsaz, Farideh et al. (2013) CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res 41:D348-52

Showing the most recent 10 out of 36 publications