PubChem provides a public repository of chemical-structure records contributed by more than 200 organizations. Processing is automated, allowing PubChem's Substance database to grow to over 120 million records in 9 years. A critical aspect of chemical-structure processing is standardization of valence-bond models, to provide the unique tautomer and/or resonance form stored in PubChem's Compound database. Standardization enables cross-linking of deposited Substance records that represent identical chemical structures and calculation of accurate comparison scores to detect compounds with similar though not identical chemical structures. PubChem's chemical structure databases can be searched by chemical name or structure and can display results as NCBI's Entrez document-summary lists, structure similarity diagrams, or detailed Compound and Substance summaries that include relevant biological activity information. An informatics project completed last year calculates multiple theoretical three-dimensional structures for compounds in PubChem. Conformer similarity scores are used as an alternative means to select structurally similar compounds, and in analysis tools that display active-compound and bioactivity similarities among PubChem Bioassays. While accurate enumeration of conformers and selection of stable representatives is possible for chemical structures below a certain complexity, including most in PubChem, theoretical selection of conformers responsible for a given biological activity is most often not possible. An ongoing research and informatics project aims to improve PubChem structure-activity analysis tools by selecting from theoretical ensembles those conformers where conformer-similarity scores are most highly correlated with the experimental bioactivity scores in each PubChem Bioassay. The goal is to provide a chemoinformatics tool that can identify phamacophores, or three-dimensional chemical substructures most associated with bioactivity. PubChem's Bioassay database is a public repository for the results of chemical biology screening experiments, many provided by grantees of the NIH Molecular Libraries Program (MLP). The number of Bioassays has grown to over 700 thousand records, containing the results of over 200 million tests of the biological activity of specific chemical reagents. PubChem Bioassays deposited by experimentalists contain a description of experimental protocols and are carefully curated to assure clarity of the experimental readouts provided in the data table associated with each record. Explicit links between Bioassays are created automatically, whenever two Bioassays report test results for one or more of the same reagents, report one or more of the same reagents as biologically active, link to target proteins or genes sequence-similar to one another, and/or link to the same cited publication. Usage of PubChem has also grown this year, to a daily average of over 90,000 users, comparable to other NCBI information resources of interest to scientists in particular disciplines. Informatics projects undertaken this year reflect the growing diversity of bioactivity results reported in PubChem Bioassays. Panel Bioassays that report the activity of tested reagents against many specified targets continue to be used by MLP grantees to demonstrate selectivity of reagents for targets of interest. To accommodate panel Bioassays, new target labels for Bioassay data table readouts were required, as were new readout selections for panel Bioassays related to other Bioassays by target-sequence or reagent-bioactivity similarity. Screens of the biological effects of Small Interfering RiboNucleic Acid (siRNA) reagents that "knock out" expression of individual gene products have also increased. New links to the genes targeted by each reagent were required, as were links to matching siRNA sequences if present in GenBank. Work is in progress to displayBioassay similarity based on siRNA and/or target gene sequence similarity. Another expanded experiment type this year is simplified summary Bioassays, used by MLP grantees to provide an easy-to-update "bottom line" of a multiple-Bioassay screening experiment. To date over 250 summary Bioassays have been deposited. A new source of bioassay results this year has been the contribution of literature-extracted bioactivity experiments by the European Bioinformatics Institute (EBI) ChEMBL project. A total of over 600,000 bioassay records, over 99% of the PubChem total, are now derived from this source. ChEMBL bioassays only report the test results shown in the literature, however, not HTS results as reported by MLP and other experimentalists. So in terms of reported reagent bioactivities, they represent only about 5% of the PubChem total. Other new informatics projects were undertaken to improve the usability and discoverability of PubChem. An important addition last year is the "selected records" box shown to the top-left of all Entrez document-summary lists for PubChem Compound, Substance and Bioassay records. Now labeled "Refine your results", the box summarizes available annotation, such as bioactivity experiments for certain compounds, or protein-target definitions for certain bioassays. The "selected records" box also includes annotation from the new NCBI BioSystems database describing metabolic, transcription-factor, or other systems biology pathways. Each BioSystem record is defined by a description and lists of the molecules forming the pathway, be they protein and/or gene sequences, and/or the chemical structures of metabolites, reagents, or drugs. This annotation in the "selected records" box provides important information on the biological processes affected by a compound, or studied via the gene/protein target of a Bioassay. A similar "selected records" box is shown to the top-left of Entrez document-summary lists for the BioSystems database. This annotates the genes, proteins, and chemical structures forming the over 135,000 BioSystems now included in the collection.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kim, Sunghwan; Thiessen, Paul A; Bolton, Evan E et al. (2016) PubChem Substance and Compound databases. Nucleic Acids Res 44:D1202-13
Kim, Sunghwan; Thiessen, Paul A; Bolton, Evan E et al. (2015) PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res 43:W605-11
Kim, Sunghwan; Han, Lianyi; Yu, Bo et al. (2015) PubChem structure-activity relationship (SAR) clusters. J Cheminform 7:33
Marchler-Bauer, Aron; Derbyshire, Myra K; Gonzales, Noreen R et al. (2015) CDD: NCBI's conserved domain database. Nucleic Acids Res 43:D222-6
Hähnke, Volker D; Bolton, Evan E; Bryant, Stephen H (2015) PubChem atom environments. J Cheminform 7:41
Wang, Yanli; Suzek, Tugba; Zhang, Jian et al. (2014) PubChem BioAssay: 2014 update. Nucleic Acids Res 42:D1075-82
Hao, Ming; Wang, Yanli; Bryant, Stephen H (2014) An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal Chim Acta 806:117-27
Madej, Thomas; Lanczycki, Christopher J; Zhang, Dachuan et al. (2014) MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res 42:D297-303
Pan, Yongmei; Cheng, Tiejun; Wang, Yanli et al. (2014) Pathway analysis for drug repositioning based on public database mining. J Chem Inf Model 54:407-18
Cheng, Tiejun; Pan, Yongmei; Hao, Ming et al. (2014) PubChem applications in drug discovery: a bibliometric analysis. Drug Discov Today 19:1751-6

Showing the most recent 10 out of 48 publications