PubChem provides a public repository of chemical-structure records contributed by more than 100 organizations. Processing is automated, allowing PubChem's Substance database to grow to over 61 million records in 5 years. A critical aspect of chemical-structure processing is standardization of valence-bond models, to provide the unique tautomer and/or resonance form stored in PubChem's Compound database. Standardization enables cross-linking of deposited Substance records that represent identical chemical structures and calculation of accurate comparison scores to detect compounds with similar though not identical chemical structures. PubChem's chemical structure databases can be searched by chemical name or structure and can display results as NCBI's Entrez document-summary lists, structure similarity diagrams, or detailed Compound or Substance summaries that include relevant biological activity information. An informatics project completed this year calculates theoretical three-dimensional structures for compounds in PubChem. Conformer similarity scores are used as an alternative means to select structurally similar compounds, and in analysis tools that display active-compound and bioactivity similarities among PubChem Bioassays. While accurate enumeration of conformers and selection of stable representatives is possible for chemical structures below a certain complexity, including most in PubChem, theoretical selection of conformers responsible for a given biological activity is most often not possible. An ongoing research and informatics project aims to improve PubChem structure-activity analysis tools by selecting from theoretical ensembles those conformers where conformer-similarity scores are most highly correlated with the experimental bioactivity scores in each PubChem Bioassay. The goal is to provide a chemoinformatics tool that can identify phamacophores, or three-dimensional chemical substructures most associated with bioactivity. PubChem's Bioassay database is a public repository for the results of chemical biology screening experiments, largely provided by grantees of the NIH Molecular Libraries Program (MLP). The number of Bioassays has grown this year to over 1,700 records, containing the results of over 58 million tests of the biological activity of specific chemical reagents. PubChem Bioassays contain a description of experimental protocols and are carefully curated to assure clarity of the experimental readouts provided in the data table associated with each record. Explicit links between Bioassays are created automatically, whenever two Bioassays report test results for one or more of the same reagents, report one or more of the same reagents as biologically active, and/or link to target proteins or genes sequence-similar to one another. Usage of PubChem has also grown this year, to a daily average of over 50,000 users, comparable to other NCBI information resources of interest to scientists in particular disciplines. Informatics projects undertaken this year reflect the growing diversity of bioactivity results reported in PubChem Bioassays. A new experiment type is panel Bioassays that report the activity of tested reagents against many specified targets, as sometimes used by MLP grantees to demonstrate selectivity of reagents for targets of interest. To accommodate panel Bioassays new target labels for Bioassay data table readouts were required, as were new readout selections for panel Bioassays related to other Bioassays by target-sequence or reagent-bioactivity similarity. Another new experiment type is a screen of Small Interfering RiboNucleic Acid (siRNA) reagents, experiments that test the effect of siRNA gene-product """"""""knock outs"""""""" on biological processes. New links to the genes targeted by each reagent were required, as were links to matching siRNA sequences if present in GenBank. Work is in progress to display Bioassay similarity based on siRNA and/or target gene sequence similarity. Another new experiment type this year was simplified summary Bioassays, requested by MLP steering committee members to provide an easy-to-update """"""""bottom line"""""""" of a multiple-Bioassay screening experiment. To date over 110 summary Bioassays have been deposited. Two other new informatics projects were undertaken to improve the usability and discoverability of PubChem. One is the NCBI BioSystems database, describing metabolic, transcription-factor, or other systems biology pathways. Each record is defined by a description and lists of the molecules forming the pathway, be they protein and/or gene sequences, and/or the chemical structures of metabolites, reagents or drugs. BioSystems are deposited by others, and after a few months number over 100,000 records. Some are from the Kyoto Encyclopedia of Genes and Genomes, for example, whose chemical metabolite structures were first deposited into PubChem years ago. Presence of chemical structures in common BioSystems has been used to cross-link chemicals related in this way, and to similarly cross-link BioSystem-related genes and proteins. Another new informatics project is under active development, a """"""""selected records"""""""" annotation box soon to appear on all PubChem """"""""document summary"""""""" displays. This subsets the Substances, Compounds or Bioassays retrieved by a NCBI Entrez search, indicating records annotated by additional information. For example, in the """"""""selected Compounds"""""""" box, the presence of certain chemicals in BioSystem records is indicated by a BioSystems label, with a brief list of the names of the most populated BioSystems and optional links to further details. The presence of Bioassay targets in Biosystems as genes or proteins is similarly annotated in the """"""""selected Bioassays"""""""" box. """"""""Selected records"""""""" displays are meant to better present to users additional information available within NCBI information resources as a whole, and we plan to continue their further development.

Project Start
Project End
Budget Start
Budget End
Support Year
6
Fiscal Year
2009
Total Cost
$4,883,521
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Kim, Sunghwan; Thiessen, Paul A; Bolton, Evan E et al. (2016) PubChem Substance and Compound databases. Nucleic Acids Res 44:D1202-13
Marchler-Bauer, Aron; Derbyshire, Myra K; Gonzales, Noreen R et al. (2015) CDD: NCBI's conserved domain database. Nucleic Acids Res 43:D222-6
Kim, Sunghwan; Han, Lianyi; Yu, Bo et al. (2015) PubChem structure-activity relationship (SAR) clusters. J Cheminform 7:33
Kim, Sunghwan; Thiessen, Paul A; Bolton, Evan E et al. (2015) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res 43:W605-11
Hähnke, Volker D; Bolton, Evan E; Bryant, Stephen H (2015) PubChem atom environments. J Cheminform 7:41
Madej, Thomas; Lanczycki, Christopher J; Zhang, Dachuan et al. (2014) MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res 42:D297-303
Wang, Yanli; Suzek, Tugba; Zhang, Jian et al. (2014) PubChem BioAssay: 2014 update. Nucleic Acids Res 42:D1075-82
Hao, Ming; Wang, Yanli; Bryant, Stephen H (2014) An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal Chim Acta 806:117-27
Pan, Yongmei; Cheng, Tiejun; Wang, Yanli et al. (2014) Pathway analysis for drug repositioning based on public database mining. J Chem Inf Model 54:407-18
Cheng, Tiejun; Pan, Yongmei; Hao, Ming et al. (2014) PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today 19:1751-1756

Showing the most recent 10 out of 50 publications