Construction of a Full Text Corpus for Biomedical Text Mining

Hunter, Lawrence

Abstract

There is a demonstrated community need for an annotated corpus consisting of the full texts of biomedical journal articles. There are many reasons to believe that the rate-limiting factor impeding progress in biomedical language processing today is the lack of availability of the right kind of expertly annotated data. An annotated corpus is a collection of texts with information about the meaning or structure associated with particular textual elements. Annotated corpora are a critical component of biomedical natural language processing research in two ways. First, most contemporary approaches to language processing rely at least in part on machine learning or statistical models. Such systems must be """"""""trained"""""""" on sets of examples with known outputs, so annotated corpora provide the training data vital to the construction of modern NLP systems. Second, annotated corpora provide the gold standard by which various approaches to particular text mining tasks are evaluated. Due to their central roles in training and testing language processing systems, the quality of the design and operational creation of annotated corpora place fundamental limits on what can be accomplished with such systems. Although there has been valuable work done on annotating abstracts, there are important differences between abstracts and full-text articles from a text mining perspective, and annotation of full-text journal articles has been negligible. Workers in both the biological (especially model organism database curation) community and the text mining community have independently pointed out the importance of processing the full text of scientific publications if the biomedical world is to be able to fully utilize text mining. We propose to build a large, fully annotated corpus consisting of full texts of biomedical journal articles. Additionally, previous biomedical corpus annotation efforts have often utilized ad hoc ontologies that have limited their utility outside of the groups that created them. We will ensure community acceptability by annotating with respect to community-consensus ontologies such as the Gene Ontology and the UMLS. Since the task involves expensive human labor, efficiency is a key issue in creating corpora. For this reason, we propose to build a team that includes the builder of the largest semantically annotated corpus to date, one of the pioneers of the model organism databases, and an already-assembled cadre of experienced linguistic and domain-expert annotators.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Resources Project Grant (NLM) (G08)
Project #: 5G08LM009639-03
Application #: 7673720
Study Section: Special Emphasis Panel (ZLM1-ZH-H (M3))
Program Officer: Sim, Hua-Chuan

Project Start: 2007-09-15
Project End: 2010-09-14
Budget Start: 2009-09-15
Budget End: 2010-09-14
Support Year: 3
Fiscal Year: 2009
Total Cost: $142,851
Indirect Cost

Institution

Name: University of Colorado Denver
Department: Pharmacology
Type: Schools of Medicine
DUNS #: 041096314

City: Aurora
State: CO
Country: United States
Zip Code: 80045

Related projects


NIH 2009 G08 LM	Construction of a Full Text Corpus for Biomedical Text Mining Hunter, Lawrence E. / University of Colorado Denver	$142,851
NIH 2009 G08 LM	Construction of a Full Text Corpus for Biomedical Text Mining Hunter, Lawrence E. / University of Colorado Denver	$66,015
NIH 2008 G08 LM	Construction of a Full Text Corpus for Biomedical Text Mining Hunter, Lawrence E. / University of Colorado Denver	$132,030
NIH 2007 G08 LM	Construction of a Full Text Corpus for Biomedical Text Mining Hunter, Lawrence E. / University of Colorado Denver	$130,432

Publications

Boguslav, Mayla; Cohen, K Bretonnel; Baumgartner, William A et al. (2018) Improving precision in concept normalization. Pac Symp Biocomput 23:566-577

Cohen, K Bretonnel; Lanfranchi, Arrick; Choi, Miji Joo-Young et al. (2017) Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics 18:372

Verspoor, Karin; Cohen, K Bretonnel; Hunter, Lawrence (2009) The textual characteristics of traditional and Open Access scientific journals are similar. BMC Bioinformatics 10:183

Cohen, K Bretonnel; Palmer, Martha; Hunter, Lawrence (2008) Nominalization and alternations in biomedical language. PLoS One 3:e3158

Zweigenbaum, Pierre; Demner-Fushman, Dina; Yu, Hong et al. (2007) Frontiers of biomedical text mining: current progress. Brief Bioinform 8:358-75

Comments

Be the first to comment on Lawrence Hunter's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: