The rapid growth of the biomedical literature and the expansion in disciplinary biomedical research, heralded by high-throughput genome sciences and technologies, have overwhelmed scientists who attempt to assimilate information necessary for their research. The widespread adoption of title/abstract word searches, such as highly desirable the National Library of Medicine's PubMed system, has provided the first major advance in the way bioscientists find relevant publications since the origin of Index Medicus in 1879 (Hunter and Cohen 2006). The importance of developing valid information retrieval systems for bioscientists has led to the development of information systems worldwide (e.g., Arrowsmith (Smalheiser and Swanson 1998), BioText (Hearst 2003), GeneWays (Friedman et al. 2001;Rzhetsky et al. 2004), iHOP (Hoffmann and Valencia 2005), and BioMedQA (Lee et al. 2006a), and annotated databases (e.g., SWISSPROT, OMIM (Hamosh et al. 2005) and BIND (Alfarano et al. 2005)). However, most of information systems target only text information and fail to provide access to other important data such as images (e.g., figures). More than any other documentation, figures usually represent the """"""""evidence"""""""" of discovery in the biomedical literature. Full-text biological articles nearly always incorporate figures/images that are the crucial content of the biomedical literature. Our examination of biological articles in the Proceedings of the National Academy of Sciences (PNAS) revealed the occurrence of 5.2 images per article on average (Yu and Lee 2006a). Biologists need to access image data to validate research facts and to formulate or to test novel research hypotheses. It has been evaluated that textual statements reported in literature frequently are noisy (i.e., containing """"""""false facts"""""""") (Krauthammer et al. 2002). Capturing images that are experimental """"""""evidence"""""""" to support the textual """"""""fact"""""""" will benefit bioscience information systems, databases, and bioscientists. Unfortunately, this wealth of information remains virtually inaccessible without automatic systems to organize these images. We propose the development of advanced natural language processing (NLP) tools to semantically organize images. We hypothesize that text that associated with images semantically entails the image content and natural language processing techniques can be developed to accurately associate the text to their images. Furthermore, we hypothesize that images can be semantically organized by categories specified by standard biological ontology, and that natural language processing approaches can accurately assign the ontological categories to images.
Our specific aims are:
Aim 1 : To develop and evaluate NLP techniques for identifying textual statements that correspond to images in full-text articles. We will develop different approaches for two types of the associations. We will first propose rule-based and statistical approaches to identify the associated text that appears in the full-text articles. We will then develop hybrid approaches to link sentences in abstracts to images in the body of the articles.
Aim 2 : To develop and evaluate NLP techniques for automatic classification of experimental results into categories (e.g., Western-Blot, PCR verification, etc) specified in the experimental protocol Protocol-Online. We will explore the use of dictionary-based, rule-based, image classification, and machine-learning approaches for accomplishing this aim.
Aim 3 : To develop and evaluate NLP techniques for automatic assignment of Gene Ontology categories to experiments, which will provide a knowledge-based organization of experiments according to biological properties (e.g., catalytic activity). We will develop statistical and machine-learning approaches for accomplishing this aim. We found that most of the images that appear in full-text biological articles are figure images (Yu and Lee 2006a) and we therefore focus on figure images only in this proposal. The deliverable of Specific Aim 1 will be an effective user-interface BioEx from which bioscientists can access images directly from sentences in the abstracts. BioEx has the promise of improvement over the traditional single-document-per-article format that has dominated bioscience publications since the first scientific article appeared in 1665 (Gross 2002). The deliverables of Specific Aim 2 and 3 will be open-source algorithms and tools that accurately map images to categories specified by the Gene Ontology and the Protocol Online. Those algorithms and tools will enhance bioscience information retrieval, information extraction, summarization, and question answering.

Agency
National Institute of Health (NIH)
Institute
National Center for Research Resources (NCRR)
Type
Exploratory/Developmental Grants (R21)
Project #
5R21RR024933-02
Application #
7534822
Study Section
Special Emphasis Panel (ZLM1-ZH-H (M3))
Program Officer
Brazhnik, Olga
Project Start
2007-12-01
Project End
2010-11-30
Budget Start
2008-12-01
Budget End
2010-11-30
Support Year
2
Fiscal Year
2009
Total Cost
$179,517
Indirect Cost
Name
University of Wisconsin Milwaukee
Department
Other Health Professions
Type
Schools of Allied Health Profes
DUNS #
627906399
City
Milwaukee
State
WI
Country
United States
Zip Code
53201
Bockhorst, Joseph P; Conroy, John M; Agarwal, Shashank et al. (2012) Beyond captions: linking figures with abstract sentences in biomedical articles. PLoS One 7:e39618
Kim, Daehyun; Yu, Hong (2011) Figure text extraction in biomedical literature. PLoS One 6:e15338
Agarwal, Shashank; Yu, Hong (2011) Figure summarizer browser extensions for PubMed Central. Bioinformatics 27:1723-4
Zhang, Qing; Cao, Yong-Gang; Yu, Hong (2011) Parsing citations in biomedical articles using conditional random fields. Comput Biol Med 41:190-4
Yu, Hong; Liu, Feifan; Ramesh, Balaji Polepalli (2010) Automatic figure ranking and user interfacing for intelligent figure search. PLoS One 5:e12983
Cao, Yonggang; Li, Zuofeng; Liu, Feifan et al. (2010) An IR-aided machine learning framework for the BioCreative II.5 Challenge. IEEE/ACM Trans Comput Biol Bioinform 7:454-61
Li, Zuofeng; Liu, Feifan; Antieau, Lamont et al. (2010) Lancet: a high precision medication event extraction system for clinical text. J Am Med Inform Assoc 17:563-7
Agarwal, Shashank; Yu, Hong (2009) FigSum: automatically generating structured text summaries for figures in biomedical literature. AMIA Annu Symp Proc 2009:6-10
Kim, Daehyun; Yu, Hong (2009) Hierarchical image classification in the bioscience literature. AMIA Annu Symp Proc 2009:327-31
Agarwal, Shashank; Yu, Hong (2009) Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion. Bioinformatics 25:3174-80

Showing the most recent 10 out of 11 publications