The proposed research aims to develop and advance tools for using image-data appearing in scientific publications, in addition to text, in order to support beneficial, targeted access to the biomedical literature. The number of biomedical publications grows at a rate of over one million new publications per year. Identifying relevant information requires scientists and physicians to scan daily through a myriad of papers. For scientific database curators (bio-curators, in organizations such as Jackson Labs or UniProt), the task is particularly onerous, as they must identify articles most significant to the database, locate within them high-quality evidence concerning disease, genes/proteins and mutations, and curate the findings in database entries along with references to relevant evidence in the articles. Notably, much of the evidence within publications lies in figures. Thus, images are rich and essential indicators for relevance. While biomedical text mining tools are being developed to expedite search for information within publications, several competitive shared tasks underscored the need for more effective tools to overcome the bottleneck for bio-curation and for scientific discovery. Moreover, bio-curators point-out the importance of images as a key information source. While image analysis is an active research field, most current work on biomedical image processing focuses on image identification, understanding and indexing; Not on images as aids to document analysis. Similarly, most work on biomedical literature mining focuses on text alone. Thus, little has been done so far to utilize, in addition to text, images within publications that provide important cues about the relevance of the information embedded in articles. Our premise, supported by bio-curators experience, is that information derived from images can (and should) be directly incorporated into biomedical document retrieval and classification, and will improve accurate identification of relevant articles (for a given user?s needs) while pin-pointing significant evidence within them. We will comprehensively identify, develop and compare informative image-features, develop methods and tools for representing both images and documents based on such features, and introduce means to effectively integrate image-based data into the text-based document classification process. The work will comprise the following fundamental tasks: A) Building robust tools for harvesting images from PDF articles and segmenting compound figures into individual image-panels; B) Identification and investigation of highly-informative features for biomedical image-representation, and categorization of biomedical images into significant types and classes; C) Effective representation of documents using text and image, and integration of text-based and image-based classifiers. We anchor our research in genuine needs, secure access to much image data, and strive for broad-applicability of the results, by working within several broad and diverse curation-areas within institutes with which we collaborate: Evidence for gene-expression & phenotypes in Mouse (Jackson Labs) and in worm (WormBase), and experimental evidence for protein-protein interaction (Protein Information Resource). The work on this project will result in new methods and tools that take advantage of both image- and text-data, facilitating more effective and focused retrieval and mining, thus better supporting bio-curation and data-intensive biomedical discovery.

Public Health Relevance

Published biomedical literature forms a vast information-source for biomedical scientists and physicians; both treatment decisions and research toward bio-medical discovery are based on such information. The proposed research aims to support and speed-up the search for information while improving effective access to the most relevant part of the biomedical literature, by developing new methods and tools that take advantage of the highly-informative image data within publications. The successful outcome of this research will lead to the development of well-targeted, effective tools for finding information pertinent to biological phenomena and medical needs, thus expediting focused biomedical discovery, including better understanding of the role of gene mutations in disease mechanisms, uncovering interactions among proteins, and revealing potential new drugs and drug-targets.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
1R01LM012527-01A1
Application #
9457095
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Vanbiervliet, Alan
Project Start
2017-09-14
Project End
2021-08-31
Budget Start
2017-09-14
Budget End
2018-08-31
Support Year
1
Fiscal Year
2017
Total Cost
Indirect Cost
Name
University of Delaware
Department
Biostatistics & Other Math Sci
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
059007500
City
Newark
State
DE
Country
United States
Zip Code
19716
Li, Pengyuan; Jiang, Xiangying; Kambhamettu, Chandra et al. (2018) Compound image segmentation of published biomedical figures. Bioinformatics 34:1192-1199
Elhalawani, Hesham; Lin, Timothy A; Volpe, Stefania et al. (2018) Machine Learning Applications in Head and Neck Radiation Oncology: Lessons From Open-Source Radiomics Challenges. Front Oncol 8:294