PubChems databases contain over 10 million unique chemical structures linked to over 600 types of biological tests, with a total of over 12 million substance test results. The PubChem information resource is accessed by over 25,000 users per day, many requesting bioactivity summary views of specific compound sets. In the words of a 2007 editorial in Nature Chemical Biology, commenting on NIHs Molecular Libraries Program, PubChems """"""""annotations can open up new insights into the interplay between chemical structure and biological activity.""""""""? ? Design of a public chemical-structure archive must address two fundamental, and to some extent, contradictory requirements. On one hand the archive must faithfully represent the chemical structures, chemical names, web site links, and other information exactly as provided by contributors. This is essential for attribution and for users who want precisely the """"""""view"""""""" provided by a particular depositor. On the other hand, users who want to look up the biological activities of a compound are normally interested in an aggregate view, where results from different contributors are combined and summarized. PubChem addresses these requirements via a 2-layer archive design. As-deposited chemical structure information is retained in a """"""""Substance"""""""" layer, with depositor updates handled through a conventional versioning scheme. A computationally derived """"""""Compound"""""""" layer then maps alternate but equivalent valence-bond models and tautomers to a single """"""""cannonicalized"""""""" reference structure.? ? Indexing of chemical structure similarity, or """"""""neighboring,"""""""" is as useful in PubChem as in biopolymer sequence databases, and for the same reason: Users may be able to infer possible biological activities for as yet untested compounds from those of similar compounds. Chemical structure similarity is conventionally detected by similarity of substructure-composition vectors, or """"""""fingerprints"""""""", which are constructed in PubChem using a well-known and documented method. Structure similarities in PubChem generally reflect engineering of analog series by chemists, rather than evolution and natural selection. For this reason PubChem also offers a graphical """"""""Structure Clustering"""""""" tool that assists users in distinguishing analog series and in examining associations with bioactivity. PubChem also supports an on-the-fly structure search service. This allows a user to input one or more structures and returns a list of same or similar structures in PubChem, or structures in PubChem for which the input is a substructure.? ? Design of a public archive for biomolecular screening results also must address two somewhat contradictory requirements. On one hand, the archive must fully preserve experimental results so that scientists can examine the strength of evidence for any reported biological activity. On the other hand, much of the discovery value of the archive derives from a users ability to compare activities across assays, to determine, for example, that some compounds active in an enzyme assay that reports IC50 values are also active in a cell-based assay that reports the fractions of cells staining red and/or green. PubChem addresses these requirements with a """"""""flexible"""""""" bioassay data model. The protocol description and definition of experimental readouts are as the depositing scientist specifies, to faithfully reflect his or her experimental results. To enable comparison across assays, however, PubChem enforces a semantic rule as to how results may be reported: only one summary result per tested substance. With this rule in place, depositors are asked to report two additional summary results for each tested substance: """"""""Outcome,"""""""" as in """"""""active"""""""" or """"""""inactive"""""""" in the depositors judgment, and """"""""Score,"""""""" a number indicating relative potency.? ? Discovery of relationships among the biological targets and processes probed by PubCheem bioassays is one the research potentials of the PubChem archive. To assist, PubChems retrieval and analysis tools include bioassay """"""""neighboring,"""""""" or pre-computed links to similar bioassays. As a feasible approach my colleagues and I have chosen three similarity metrics related to biological processes: neighbors as detected by depositor assertion, neighbors as detected by sequence similarity of protein targets, and neighbors as detected by activity profile similarity. Bioassay target-similarity neighbors are detected by straightforward calculation of sequence similarities of depositor-identified protein targets. Users may view target-similarity neighbors as a list of related bioassays within Entrez. However, a users intent in asking about related targets is often to ask how selective are the identified actives for the target of a bioassay, as compared to related targets of other bioassays. To assist, PubChem supports drill down to """"""""Structure-Activity Analysis"""""""" displays which cluster bioassays and compounds in a """"""""heatmap"""""""" that indicates the presence of active or inactive compounds in each cluster. For selectivity analysis involving related-target bioassays, the tool provides direct visualization of whether the bioassays identify compounds selective for a particular target, partially selective among the most related targets, and/or broadly active across many related targets.? ? Activity-profile neighboring is perhaps PubChems most powerful tool for detection of bioassay relationships in as much as it depends only on the results of bioactivity testing itself. In this context the activity profile of a bioassay is simply the list of active and inactive compounds identified. Activity-overlap is defined simply as a percentage: of the compounds tested in common by two bioassays, what percentage is active in both? Activity-profile neighbors may be viewed as a list of related assays within Entrez. While lists sorted by overlap score are typically effective in separating truly related bioassays from """"""""noise"""""""" due to the false-positive hits present in screening assays, the user can again delve further into how the assays are related by drilling down to PubChem's """"""""Structure-Activity Analysis"""""""" display. In this context the tool clusters bioassays by activity profile similarity, the metric that has been used to detect related bioassays, and clusters compounds by chemical structure similarity. The display typically identifies truly related bioassays as those clustered both by activity profile similarity and chemical structure similarity, since many of the same compounds are active in each related bioassay, and structural similarity reflects the analog series typically explored in a screening campaign.? ? Many links from PubChem to other NCBI databases are derived from depositor-supplied information, such as the protein target of a bioassay or a publication reporting synthesis of a compound. Other links are derived computationally by the PubChem team. Links to protein 3-dimensional structures are derived, for example, by extracting ligand chemical structures from each protein structure record and adding those structures to PubChem. Links to PubMed are based on matching depositor-supplied chemical names in PubChem to Medical Subject Headings (MeSH) from the drugs and chemicals subtree of the MeSH ontology, and in turn to PubMed articles indexed under those terms by the MeSH indexing team at NLM.? ? PubChem contains chemical structures and biological testing results contributed by over 70 academic, commercial, and governmental organizations. To facilitate data acquisition from these many depositors the PubChem team has developed an automated deposition gateway, an interactive, web-based software system that allows depositors to add and update contributed records. This systemprovide numerous data sanity and consistency checks, interpretable data-error messages, and a wizard system to guide depositors through the upload process.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM100604-04
Application #
7594477
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
4
Fiscal Year
2007
Total Cost
$2,789,314
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Church, Deanna M; Hillier, LaDeana W (2009) Back to Bermuda: how is science best served? Genome Biol 10:105
Wang, Yanli; Xiao, Jewen; Suzek, Tugba O et al. (2009) PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37:W623-33
Han, Lianyi; Wang, Yanli; Bryant, Stephen H (2009) A survey of across-target bioactivity results of small molecules in PubChem. Bioinformatics 25:2251-5
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2009) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 37:D5-15
Borodina, Yulia V; Bolton, Evan; Fontaine, Fabien et al. (2007) Assessment of conformational ensemble sizes necessary for specific resolutions of coverage of conformational space. J Chem Inf Model 47:1428-37
Cheng, Kenneth T; Menkens, Anne; Bryant, Steve et al. (2007) NIH MICAD initiative and guest author program opportunities. J Nucl Med 48:19N
Fontaine, Fabien; Bolton, Evan; Borodina, Yulia et al. (2007) Fast 3D shape screening of large chemical databases through alignment-recycling. Chem Cent J 1:12
Wheeler, David L; Barrett, Tanya; Benson, Dennis A et al. (2007) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 35:D5-12