One motivation of our tautomerism-related work is thus to use all tools at our disposal, chemoinformatics analyses, QM computations, experimental work, and systematic extraction of results from literature, to provide a scientific footing for the recommendations how to improve handling of tautomerism in InChI V2 - instead of just holding a vote in the Working Group. While prototropic tautomerism rules are the only ones currently implemented as the standard rule set in CACTVS, and all tautomeric transformations covered by InChI (as default or by option) are prototropic, ring-chain (RC) tautomerism is well-known and widespread. Nevertheless, and somewhat surprisingly, very little in terms of RC rules was available in chemoinformatics until recently. Based on Baldwin's well-known set of rules to predict the relative facility of ring forming reactions, we developed a set of 11 rules describing RC tautomerism. The rules were encoded in SMIRKS line notation, the chemical transform extension of the chemical structure line notation SMILES, developed by Daylight Chemical Information Systems, Inc., just like the currently 20 individual rules in CACTVS for describing prototropic tautomerism are encoded. A number of modifications were applied to Baldwin's rule set, which, after all, were rules for ring-closure in general, not for RC tautomerism in specific. Foremost, ring closure and opening reactions involving a tetrahedral electrophilic carbon thus leading to breakage of a single bond would cause a loss of atoms to the molecule, violating the definition of tautomerism. Adding these new RC rules to the existing standard prototropic rules in CACTVS, we applied this combined rule set to the poster child of RC tautomerism: warfarin. This anticoagulant drug, in wide use for decades, can theoretically exist in solution in 40 distinct tautomeric forms. We investigated all these tautomers with computational approaches (relative energies calculated at the B3LYP/6-311G+ level of theory) and recorded NMR (13C and 1H) spectra. We introduced an intuitive and graphical network for tautomers and their interconversion paths, which for warfarin contained 11 tautomers and 17 tautomeric transformations between them allowed by our rules. We then applied the combined RC and prototropic rule set to an entire database: the Aldrich Market Select (AMS) database of (then) 6 million screening samples and building blocks [96]. We found over 30,000 cases where two or more AMS products were declared by our rules to be just different tautomeric forms of the same compound. 1H and 13C NMR analysis of 166 such tautomer pairs (plus a few triplets) we purchased from the AMS were performed to determine whether the chemoinformatics transforms had accurately predicted what was the same stuff in the bottle as determined by NMR. Essentially all prototropic transforms for which examples in the AMS existed (some of the rarer types of tautomerism had no such conflict pairs in the AMS) were confirmed. Some of the RC transforms were found to be too aggressive, i.e. to equate structures with one another that were different compounds according to the NMR analyses. This paper received an Editor's Choice selection in the Journal of Chemical Information and Modeling. In order to provide additional experimental data for tautomerism-related analyses and chemoinformatics work, we have created a database based on data extracted from experimental literature. This database consists of 1,873 entries which belong to n-tuples of tautomers studied in a particular set of experimental conditions (pH, solvent, temperature, technique), adding up to 3,898 records since the average of n is slightly 2. The data were extracted from 73 publications, many of them reviews, taken from a selection of 200 papers provided to the contractor company that did the initial extraction (Parthys Reverse Informatics), out of about 900 papers we identified in literature searches that might contain useful data for this purpose. Each tautomer (or tuple, as appropriate) is annotated with Structural information: SMILES, InChI, InChIKey, NCI/CADD Identifiers; Prevalence data: measured ratios, interconversion rates, relative energies etc.; Condition data: solvent, temperature, pH etc. (if given); Method data: NMR, UV spectroscopy, IR spectroscopy etc.; Reference data: Bibliographic information. To the best of our knowledge, such as tautomer database does not exist elsewhere, certainly not in the public domain.
Guasch, Laura; Yapamudiyansel, Waruna; Peach, Megan L et al. (2016) Experimental and Chemoinformatics Study of Tautomerism in a Database of Commercially Available Screening Samples. J Chem Inf Model 56:2149-2161 |