The proposed research aims to support and improve effective access to the biomedical literature, by utilizing the rich, highly-informative image data within publications, in addition to text. The biomedical literature is expanding at a rate of about 1,000,000 new publications a year. Scientists and physicians, as part of their daily work, go through a myriad of publications searching for relevant information. The task is even more arduous for scientific database curators (bio- curators, in organizations such as FlyBase or UniProt), who have to identify the literature most relevant to the database area, locate within it high-quality evidence concerning genes, proteins, organisms, or disease, and curate the findings within a database entry, with references to the relevant literature. Notably, much of the evidence within publications lies in figures. Accordingly, images are used by scientists and database curators as indicators for relevance. To assist and expedite the search for information within the literature, automated text-mining tools are being developed;still, several shared tasks and competitive challenges demonstrated that the need for more effective automated identification of relevant information in biomedical publications remains a bottleneck for bio-curation and for scientific discovery. While image analysis within and outside the biomedical domain is an active research area, most current work on biomedical image processing focuses on retrieval and understanding of images as a primary form of data. Likewise, most efforts on biomedical literature retrieval and mining focus on text alone. Little has been done so far to use images within publications, which provide important cues as to the relevance of information embedded in papers. The hypothesis underlying our proposal is that useful information can be derived directly from images within publications and integrated with text-based methods, leading to improved identification of relevant publications and of informative portions within them. The proposed research comprises extensive comparative study of highly-informative features within images, development and identification of such image-features, development of tools that extract such features and information from images, and integration of image-based information into the textual articles-classification process, aiming to determine the publications'relevance to well-defined biomedical needs. The fundamental research tasks we shall address are: A) Identification and comparative study of useful features for image-representation, focusing on their utility for specific biomedical needs;B) Classification of biomedical images and biomedical documents based on image-data;C) Document classification through integration of text- and image-based classifiers. To ground the research in genuine needs, secure access to much image data, and ensure broad-applicability of the results, we shall work within three diverse areas for which we have secured access to expertise and data: Finding articles about cis-regulatory regions (Cyrene project at Brown University);Evidence for gene expression in the mouse (Jackson Lab's GXD);Experimental evidence for protein-protein interaction (Delaware's Protein Information Resource). The successful completion of the proposed project will provide integrated methods and tools, utilizing both image-based and text-based features, leading to more focused and effective retrieval and mining tools, thus better supporting data-intensive biomedical discovery.
Physicians and bio-medical scientists rely on the vast published literature as their main source of information about current developments and findings, on which they base both patient treatment and ongoing research toward bio- medical discovery. The proposed research will support physicians and scientists, speed-up their search for information and improve their effective access to the most relevant part of the biomedical literature, by developing new methods and tools that take advantage not only of text but also of the highly-informative image data within publications. The successful outcome of this research will lead to the development of focused, effective tools for finding information pertinent to biological phenomena and medical needs, thus expediting targeted biomedical discovery including better understanding of disease mechanism, uncovering possible means for accurate diagnoses, and revealing potential new drugs and drug-targets.
|Shatkay, Hagit; Brady, Scott; Wong, Andrew (2015) Text as data: using text-based features for proteins representation and for computational prediction of their characteristics. Methods 74:54-64|