Incorporating Image-based Features into Biomedical Document Classification

Shatkay, Hagit; Marai, Georgeta-Elisabeta

Abstract

The proposed research aims to develop and advance tools for using image-data appearing in scientific publications, in addition to text, in order to support beneficial, targeted access to the biomedical literature. The number of biomedical publications grows at a rate of over one million new publications per year. Identifying relevant information requires scientists and physicians to scan daily through a myriad of papers. For scientific database curators (bio-curators, in organizations such as Jackson Labs or UniProt), the task is particularly onerous, as they must identify articles most significant to the database, locate within them high-quality evidence concerning disease, genes/proteins and mutations, and curate the findings in database entries along with references to relevant evidence in the articles. Notably, much of the evidence within publications lies in figures. Thus, images are rich and essential indicators for relevance. While biomedical text mining tools are being developed to expedite search for information within publications, several competitive shared tasks underscored the need for more effective tools to overcome the bottleneck for bio-curation and for scientific discovery. Moreover, bio-curators point-out the importance of images as a key information source. While image analysis is an active research field, most current work on biomedical image processing focuses on image identification, understanding and indexing; Not on images as aids to document analysis. Similarly, most work on biomedical literature mining focuses on text alone. Thus, little has been done so far to utilize, in addition to text, images within publications that provide important cues about the relevance of the information embedded in articles. Our premise, supported by bio-curators experience, is that information derived from images can (and should) be directly incorporated into biomedical document retrieval and classification, and will improve accurate identification of relevant articles (for a given user?s needs) while pin-pointing significant evidence within them. We will comprehensively identify, develop and compare informative image-features, develop methods and tools for representing both images and documents based on such features, and introduce means to effectively integrate image-based data into the text-based document classification process. The work will comprise the following fundamental tasks: A) Building robust tools for harvesting images from PDF articles and segmenting compound figures into individual image-panels; B) Identification and investigation of highly-informative features for biomedical image-representation, and categorization of biomedical images into significant types and classes; C) Effective representation of documents using text and image, and integration of text-based and image-based classifiers. We anchor our research in genuine needs, secure access to much image data, and strive for broad-applicability of the results, by working within several broad and diverse curation-areas within institutes with which we collaborate: Evidence for gene-expression & phenotypes in Mouse (Jackson Labs) and in worm (WormBase), and experimental evidence for protein-protein interaction (Protein Information Resource). The work on this project will result in new methods and tools that take advantage of both image- and text-data, facilitating more effective and focused retrieval and mining, thus better supporting bio-curation and data-intensive biomedical discovery.

Public Health Relevance

Published biomedical literature forms a vast information-source for biomedical scientists and physicians; both treatment decisions and research toward bio-medical discovery are based on such information. The proposed research aims to support and speed-up the search for information while improving effective access to the most relevant part of the biomedical literature, by developing new methods and tools that take advantage of the highly-informative image data within publications. The successful outcome of this research will lead to the development of well-targeted, effective tools for finding information pertinent to biological phenomena and medical needs, thus expediting focused biomedical discovery, including better understanding of the role of gene mutations in disease mechanisms, uncovering interactions among proteins, and revealing potential new drugs and drug-targets.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM012527-03
Application #: 9762175
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Vanbiervliet, Alan

Project Start: 2017-09-14
Project End: 2021-08-31
Budget Start: 2019-09-01
Budget End: 2020-08-31
Support Year: 3
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: University of Delaware
Department: Biostatistics & Other Math Sci
Type: Biomed Engr/Col Engr/Engr Sta
DUNS #: 059007500

City: Newark
State: DE
Country: United States
Zip Code: 19716

Related projects


NIH 2020 R01 LM	Incorporating Image-based Features into Biomedical Document Classification Shatkay, Hagit; Marai, Georgeta-Elisabeta / University of Delaware
NIH 2020 R01 LM	Uncovering Clinical Evidence in COVID-19 Publications: An Integrated Search via Text & Images Shatkay, Hagit; Marai, Georgeta-Elisabeta / University of Delaware
NIH 2019 R01 LM	Incorporating Image-based Features into Biomedical Document Classification Shatkay, Hagit; Marai, Georgeta-Elisabeta / University of Delaware
NIH 2018 R01 LM	Incorporating Image-based Features into Biomedical Document Classification Shatkay, Hagit; Marai, Georgeta-Elisabeta / University of Delaware
NIH 2017 R01 LM	Incorporating Image-based Features into Biomedical Document Classification Shatkay, Hagit; Marai, Georgeta-Elisabeta / University of Delaware

Publications

Li, Pengyuan; Jiang, Xiangying; Kambhamettu, Chandra et al. (2018) Compound image segmentation of published biomedical figures. Bioinformatics 34:1192-1199

Elhalawani, Hesham; Lin, Timothy A; Volpe, Stefania et al. (2018) Machine Learning Applications in Head and Neck Radiation Oncology: Lessons From Open-Source Radiomics Challenges. Front Oncol 8:294

Comments

Be the first to comment on Hagit Shatkay's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: