Towards the Building of a Comprehensive Searchable Biological Experiment Database

Yu, Hong

Abstract

? ? The rapid growth of the biomedical literature and the expansion in disciplinary biomedical research, heralded by high-throughput genome sciences and technologies, have overwhelmed scientists who attempt to assimilate information necessary for their research. The widespread adoption of title/abstract word searches, such as highly desirable the National Library of Medicine's PubMed system, has provided the first major advance in the way bioscientists find relevant publications since the origin of Index Medicus in 1879 (Hunter and Cohen 2006). The importance of developing valid information retrieval systems for bioscientists has led to the development of information systems worldwide (e.g., Arrowsmith (Smalheiser and Swanson 1998), BioText (Hearst 2003), GeneWays (Friedman et al. 2001; Rzhetsky et al. 2004), iHOP (Hoffmann and Valencia 2005), and BioMedQA (Lee et al. 2006a), and annotated databases (e.g., SWISSPROT, OMIM (Hamosh et al. 2005) and BIND (Alfarano et al. 2005)). ? ? However, most of information systems target only text information and fail to provide access to other important data such as images (e.g., figures). More than any other documentation, figures usually represent the """"""""evidence"""""""" of discovery in the biomedical literature. Full-text biological articles nearly always incorporate figures/images that are the crucial content of the biomedical literature. Our examination of biological articles in the Proceedings of the National Academy of Sciences (PNAS) revealed the occurrence of 5.2 images per article on average (Yu and Lee 2006a). Biologists need to access image data to validate research facts and to formulate or to test novel research hypotheses. It has been evaluated that textual statements reported in literature frequently are noisy (i.e., containing """"""""false facts"""""""") (Krauthammer et al. 2002). Capturing images that are experimental """"""""evidence"""""""" to support the textual """"""""fact"""""""" will benefit bioscience information systems, databases, and bioscientists. ? ? Unfortunately, this wealth of information remains virtually inaccessible without automatic systems to organize these images. We propose the development of advanced natural language processing (NLP) tools to semantically organize images. We hypothesize that text that associated with images semantically entails the image content and natural language processing techniques can be developed to accurately associate the text to their images. Furthermore, we hypothesize that images can be semantically organized by categories specified by standard biological ontology, and that natural language processing approaches can accurately assign the ontological categories to images. ? ? Our specific aims are: ? ? Aim 1: To develop and evaluate NLP techniques for identifying textual statements that correspond to images in full-text articles. We will develop different approaches for two types of the associations. We will first propose rule-based and statistical approaches to identify the associated text that appears in the full-text articles. We will then develop hybrid approaches to link sentences in abstracts to images in the body of the articles. ? ? Aim 2: To develop and evaluate NLP techniques for automatic classification of experimental results into categories (e.g., Western-Blot, PCR verification, etc) specified in the experimental protocol Protocol-Online. ? ? We will explore the use of dictionary-based, rule-based, image classification, and machine-learning approaches for accomplishing this aim. ? ? Aim 3: To develop and evaluate NLP techniques for automatic assignment of Gene Ontology categories to experiments, which will provide a knowledge-based organization of experiments according to biological properties (e.g., catalytic activity). We will develop statistical and machine-learning approaches for accomplishing this aim. ? ? We found that most of the images that appear in full-text biological articles are figure images (Yu and Lee 2006a) and we therefore focus on figure images only in this proposal. The deliverable of Specific Aim 1 will be an effective user-interface BioEx from which bioscientists can access images directly from sentences in the abstracts. BioEx has the promise of improvement over the traditional single-document-per-article format that has dominated bioscience publications since the first scientific article appeared in 1665 (Gross 2002). The deliverables of Specific Aim 2 and 3 will be open-source algorithms and tools that accurately map images to categories specified by the Gene Ontology and the Protocol Online. Those algorithms and tools will enhance bioscience information retrieval, information extraction, summarization, and question answering. ? ? ?

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Center for Research Resources (NCRR)
Type: Exploratory/Developmental Grants (R21)
Project #: 1R21RR024933-01A1
Application #: 7314689
Study Section: Special Emphasis Panel (ZLM1-ZH-H (M3))
Program Officer: Brazhnik, Olga

Project Start: 2007-12-01
Project End: 2009-11-30
Budget Start: 2007-12-01
Budget End: 2008-11-30
Support Year: 1
Fiscal Year: 2008
Total Cost: $230,085
Indirect Cost

Institution

Name: University of Wisconsin Milwaukee
Department: Other Health Professions
Type: Schools of Allied Health Profes
DUNS #: 627906399

City: Milwaukee
State: WI
Country: United States
Zip Code: 53201

Related projects


NIH 2009 R21 RR	Towards the Building of a Comprehensive Searchable Biological Experiment Database Yu, Hong / University of Wisconsin Milwaukee	$179,517
NIH 2008 R21 RR	Towards the Building of a Comprehensive Searchable Biological Experiment Database Yu, Hong / University of Wisconsin Milwaukee	$230,085

Publications

Bockhorst, Joseph P; Conroy, John M; Agarwal, Shashank et al. (2012) Beyond captions: linking figures with abstract sentences in biomedical articles. PLoS One 7:e39618

Kim, Daehyun; Yu, Hong (2011) Figure text extraction in biomedical literature. PLoS One 6:e15338

Agarwal, Shashank; Yu, Hong (2011) Figure summarizer browser extensions for PubMed Central. Bioinformatics 27:1723-4

Zhang, Qing; Cao, Yong-Gang; Yu, Hong (2011) Parsing citations in biomedical articles using conditional random fields. Comput Biol Med 41:190-4

Yu, Hong; Liu, Feifan; Ramesh, Balaji Polepalli (2010) Automatic figure ranking and user interfacing for intelligent figure search. PLoS One 5:e12983

Cao, Yonggang; Li, Zuofeng; Liu, Feifan et al. (2010) An IR-aided machine learning framework for the BioCreative II.5 Challenge. IEEE/ACM Trans Comput Biol Bioinform 7:454-61

Li, Zuofeng; Liu, Feifan; Antieau, Lamont et al. (2010) Lancet: a high precision medication event extraction system for clinical text. J Am Med Inform Assoc 17:563-7

Agarwal, Shashank; Yu, Hong (2009) FigSum: automatically generating structured text summaries for figures in biomedical literature. AMIA Annu Symp Proc 2009:6-10

Kim, Daehyun; Yu, Hong (2009) Hierarchical image classification in the bioscience literature. AMIA Annu Symp Proc 2009:327-31

Agarwal, Shashank; Yu, Hong (2009) Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion. Bioinformatics 25:3174-80

Showing the most recent 10 out of 11 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: