Biological assays are the foundation for developing chemical probes and drugs, but new Big Data approaches ? which have revolutionized other areas of biomedical science ? have not yet advanced this early step of biomedical research: analysis of assay data. The obstacle is that scientists specify their assays through text descriptions written in scientific English, which need to be translated into standardized annotations readable by computers. This lack of standardized and machine-readable assay descriptions is a major impediment to manage, find, aggregate, compare, re-use, and learn from the ever-growing corpus of assays (e.g., >1.2 million in PubChem). Thus, there is a critical need for better annotation and curation tools for drug discovery assays. However, the process to go from a simple text protocol to highly detailed machine-readable semantic annotations is not trivial. Multiple tools and technologies are required: ontologies or the structured controlled vocabularies; templates that map specific vocabularies to properties that are to be captured; and software tools to actually apply these ontologies to a given text. Currently, each of these exists in isolation; yet, a bottleneck in any one tool or technology, or a gap between the different pieces, disrupts the overall process, resulting in poor or no annotation of the datasets. Here we propose a project to combine and integrate these three technologies (which are also the core competencies of the three groups collaborating on this proposal). We will deliver a novel, comprehensive, user-friendly data annotation and curation system that is highly interconnected, encompassing the full cycle, and real-world practice, of required tasks and decisions, by all parties within the `bioassay annotation ecosystem' (researchers performing curation, dedicated curators, IT specialists, ontology owners, and librarians/repositories). The alliance between academic and commercial collaborators, who already work together, will greatly benefit the project and minimize execution risk.
Our specific aims are to: (1) Develop a bioassay-specific template editor and templates by adopting the Stanford (Center for Expanded Data Annotation and Retrieval, CEDAR) data model to the machine learning-based curation tool BioAssay Express, to exploit the broad functionality of its data structures, tools and interfaces; (2) Define and create an ontology update process and tool (`OntoloBridge') to support rapid feedback between curators/users and ontology experts and enable semi-automated incorporation of suggestions for updates to existing published ontologies; (3) Develop new tools to export annotated data into public repositories such as PubChem; and (4) Evaluate our solution across diverse audiences (pharma, academia, repositories). The system will improve bioassay curation efficiency, quality, and effectiveness, enabling scientists to generate standardized annotations for their experiments to make these data FAIR (Findable, Accessible, Interoperable, Reusable). We envision this suite of tools will encourage annotation earlier in the data lifecycle while still supporting annotation at later stages (e.g., submission to repositories or to journals).
Biological assays are the foundation for developing drugs, but new Big Data approaches ? which have revolu- tionized other areas of biomedical science ? have not yet advanced this early step of biomedical research: analysis of assay data. The obstacle is that assays are written in scientific English, which need to be translated into standardized descriptions readable by computers. This lack of machine-readable annotations is a major impediment to manage, find, compare, re-use, and learn from the millions of assays. This project will develop a formal process and integrated tools to support the complete cycle of tasks and decisions required for bioassay annotation, enabling expedited (and more cost-effective) drug discovery.