? ? The rapid growth of the biomedical literature and the expansion in disciplinary biomedical research, heralded by high-throughput genome sciences and technologies, have overwhelmed scientists who attempt to assimilate information necessary for their research. The widespread adoption of title/abstract word searches, such as highly desirable the National Library of Medicine's PubMed system, has provided the first major advance in the way bioscientists find relevant publications since the origin of Index Medicus in 1879 (Hunter and Cohen 2006). The importance of developing valid information retrieval systems for bioscientists has led to the development of information systems worldwide (e.g., Arrowsmith (Smalheiser and Swanson 1998), BioText (Hearst 2003), GeneWays (Friedman et al. 2001; Rzhetsky et al. 2004), iHOP (Hoffmann and Valencia 2005), and BioMedQA (Lee et al. 2006a), and annotated databases (e.g., SWISSPROT, OMIM (Hamosh et al. 2005) and BIND (Alfarano et al. 2005)). ? ? However, most of information systems target only text information and fail to provide access to other important data such as images (e.g., figures). More than any other documentation, figures usually represent the """"""""evidence"""""""" of discovery in the biomedical literature. Full-text biological articles nearly always incorporate figures/images that are the crucial content of the biomedical literature. Our examination of biological articles in the Proceedings of the National Academy of Sciences (PNAS) revealed the occurrence of 5.2 images per article on average (Yu and Lee 2006a). Biologists need to access image data to validate research facts and to formulate or to test novel research hypotheses. It has been evaluated that textual statements reported in literature frequently are noisy (i.e., containing """"""""false facts"""""""") (Krauthammer et al. 2002). Capturing images that are experimental """"""""evidence"""""""" to support the textual """"""""fact"""""""" will benefit bioscience information systems, databases, and bioscientists. ? ? Unfortunately, this wealth of information remains virtually inaccessible without automatic systems to organize these images. We propose the development of advanced natural language processing (NLP) tools to semantically organize images. We hypothesize that text that associated with images semantically entails the image content and natural language processing techniques can be developed to accurately associate the text to their images. Furthermore, we hypothesize that images can be semantically organized by categories specified by standard biological ontology, and that natural language processing approaches can accurately assign the ontological categories to images. ? ? Our specific aims are: ? ? Aim 1: To develop and evaluate NLP techniques for identifying textual statements that correspond to images in full-text articles. We will develop different approaches for two types of the associations. We will first propose rule-based and statistical approaches to identify the associated text that appears in the full-text articles. We will then develop hybrid approaches to link sentences in abstracts to images in the body of the articles. ? ? Aim 2: To develop and evaluate NLP techniques for automatic classification of experimental results into categories (e.g., Western-Blot, PCR verification, etc) specified in the experimental protocol Protocol-Online. ? ? We will explore the use of dictionary-based, rule-based, image classification, and machine-learning approaches for accomplishing this aim. ? ? Aim 3: To develop and evaluate NLP techniques for automatic assignment of Gene Ontology categories to experiments, which will provide a knowledge-based organization of experiments according to biological properties (e.g., catalytic activity). We will develop statistical and machine-learning approaches for accomplishing this aim. ? ? We found that most of the images that appear in full-text biological articles are figure images (Yu and Lee 2006a) and we therefore focus on figure images only in this proposal. The deliverable of Specific Aim 1 will be an effective user-interface BioEx from which bioscientists can access images directly from sentences in the abstracts. BioEx has the promise of improvement over the traditional single-document-per-article format that has dominated bioscience publications since the first scientific article appeared in 1665 (Gross 2002). The deliverables of Specific Aim 2 and 3 will be open-source algorithms and tools that accurately map images to categories specified by the Gene Ontology and the Protocol Online. Those algorithms and tools will enhance bioscience information retrieval, information extraction, summarization, and question answering. ? ? ?
Showing the most recent 10 out of 11 publications