Small molecules are one of the most important classes of therapeutics alleviating suffering and in many cases death for hundreds of millions of people worldwide. Small molecules also serve as invaluable tools to study biology, often with the goal to validate novel targets for the development of future therapeutic drugs. Reproducibility of experimental results and the interoperability and reusability of resulting datasets depend on accurate descriptions of associated research objects, and most critically on correct representations of small molecules that are tested in biological assays. For example, it is not possible to develop predictive models of protein target - small molecule interactions if their chemical structure representations are not correct. Many factors contribute to errors in reported chemical structures in small molecule screening and omics reference databases, scientific publications, and many other web-based resources and documents. Because of the complexity of representing small molecules chemical structure graphs and the lack of thorough curation, errors are frequently introduced by non-experts and error propagation across different digital research assets is a pervasive problem. To address this challenging problem via a scalable approach, we propose the Automated Molecular Identity Disambiguator (AutoMID). AutoMID will be usable in batch mode at scale via an API, for example to assist chemical structure standardization and registration by maintainers of digital research assets, and also via interactive (UI) mode for everyday researchers to quickly and easily validate or correct their small molecule representations. AutoMID will leverage extensive highly standardized linked databases of chemical structures and associated information including names, synonyms, biological activity and physical properties and their sources / provenance and leverage expert rules and AI to enable reliable disambiguation of chemical structure identities at scale.
Small molecules are one of the most important types of drugs. They also serve as invaluable tools to study biology. The complexity of representing chemical graphs and the lack of thorough curation leads to frequent small molecule structure errors, which propagate across digital research assets, impeding their interoperability and reusability. To address this challenging problem, we propose the Automated Molecular Identity Disambiguator (AutoMID). Built on expert knowledge and AI, AutoMID will enable researchers and maintainers of data repositories to reliably identify and resolve ambiguities in chemical structures at scale.