PubChem contains the chemical structures of small organic molecules and information on their biological activities. It is intended to support the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. PubChem's chemical structure database may be searched on the basis of descriptive terms, chemical properties, and structural similarity. When possible, PubChem's chemical structure records are linked to other NCBI databases. These include the PubMed scientific literature database, for example, and NCBI's protein 3D structure database. PubChem also contains the results of high-throughput biological screening experiments. PubChem is organized as three linked databases within the Entrez/PubMed information retrieval system. These are PubChem Substance, PubChem Compound, and PubChem BioAssay. More information about using each component database may be found by following the link: PubChem is a new project this year. Work has focused on many cheminformatics subprojects needed to make the system as described operational in a short time frame. These include design of a robust data exchange specification for chemical structure data and generic bioassay result data. They include design of archival databases for chemical structure and bioassay data. They include design of an indexing procedure for integration of PubChem?s three component databases into the Entrez search engine. Needed development work also includes development of validation and standardization procedures for processing ?legacy? chemical structure and bioassay data. These procedures employ a mix of novel and commercially available software to produce uniform valence-bond models of chemical structures in the archive. Another subproject includes development of graphical display servers for chemical structure, both as individual ?substance summary? displays and as graphical chemical-drawing components of Entrez record-summary displays. Another includes development of procedures for calculation of standard chemical properties, for example Lipinski-rule properties and standardized descriptors such as SMILES, INChI, and IUPAC systematic chemical names. Development work was also needed to construct the similarity detection process that supports chemical identity and similarity neighbors within the Entrez system. This work again employed a mix of commercial and custom software, in particular to group compounds reasonably, given the possibility of incomplete data on stereochemistry and isotopic labeling. Yet another subproject involved development of a procedure to link chemical/trivial names provided in the input data to MeSH headings and substance names, and in turn to articles in PubMed. These links have proven to be an extremely valuable tool for biologist users searching for information on the biological activities of chemical compounds, or alternatively for information on chemical compounds associated with diagnosis or treatment of disease or other biological process. A final subproject involved development of a novel bioassay result browser. This allows users to examine descriptions of the various depositor-supplied parameters and readouts, and to construct lists of ?active? compounds according to thresholds they specify. Compounds selected in this way, according to biological activity, may in turn be used in further Entrez queries, for example to select those not containing substructures known to be associated with toxicity. Much development remains to be done, but PubChem already provides the basis for cheminformatic analysis by medicinal chemist users. The data available in the initial public PubChem release in September 2004 included approximately 850,000 chemical structure records from 10 ?legacy? government and academic sources. This represents approximately 650,000 unique chemical compounds. The initial release also includes a set of approximately 200 bioassays frrom the DTP/NCI collection, each providing activity data on approximately 50,000 compounds, on average.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Intramural Research (Z01)
Project #
Application #
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
United States
Zip Code
Church, Deanna M; Hillier, LaDeana W (2009) Back to Bermuda: how is science best served? Genome Biol 10:105
Wang, Yanli; Xiao, Jewen; Suzek, Tugba O et al. (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37:W623-33
Han, Lianyi; Wang, Yanli; Bryant, Stephen H (2009) A survey of across-target bioactivity results of small molecules in PubChem. Bioinformatics 25:2251-5
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 37:D5-15
Borodina, Yulia V; Bolton, Evan; Fontaine, Fabien et al. (2007) Assessment of conformational ensemble sizes necessary for specific resolutions of coverage of conformational space. J Chem Inf Model 47:1428-37
Cheng, Kenneth T; Menkens, Anne; Bryant, Steve et al. (2007) NIH MICAD initiative and guest author program opportunities. J Nucl Med 48:19N
Fontaine, Fabien; Bolton, Evan; Borodina, Yulia et al. (2007) Fast 3D shape screening of large chemical databases through alignment-recycling. Chem Cent J 1:12
Wheeler, David L; Barrett, Tanya; Benson, Dennis A et al. (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35:D5-12