There is a demonstrated community need for an annotated corpus consisting of the full texts of biomedical journal articles. There are many reasons to believe that the rate-limiting factor impeding progress in biomedical language processing today is the lack of availability of the right kind of expertly annotated data. An annotated corpus is a collection of texts with information about the meaning or structure associated with particular textual elements. Annotated corpora are a critical component of biomedical natural language processing research in two ways. First, most contemporary approaches to language processing rely at least in part on machine learning or statistical models. Such systems must be """"""""trained"""""""" on sets of examples with known outputs, so annotated corpora provide the training data vital to the construction of modern NLP systems. Second, annotated corpora provide the gold standard by which various approaches to particular text mining tasks are evaluated. Due to their central roles in training and testing language processing systems, the quality of the design and operational creation of annotated corpora place fundamental limits on what can be accomplished with such systems. Although there has been valuable work done on annotating abstracts, there are important differences between abstracts and full-text articles from a text mining perspective, and annotation of full-text journal articles has been negligible. Workers in both the biological (especially model organism database curation) community and the text mining community have independently pointed out the importance of processing the full text of scientific publications if the biomedical world is to be able to fully utilize text mining. We propose to build a large, fully annotated corpus consisting of full texts of biomedical journal articles. Additionally, previous biomedical corpus annotation efforts have often utilized ad hoc ontologies that have limited their utility outside of the groups that created them. We will ensure community acceptability by annotating with respect to community-consensus ontologies such as the Gene Ontology and the UMLS. Since the task involves expensive human labor, efficiency is a key issue in creating corpora. For this reason, we propose to build a team that includes the builder of the largest semantically annotated corpus to date, one of the pioneers of the model organism databases, and an already-assembled cadre of experienced linguistic and domain-expert annotators.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Resources Project Grant (NLM) (G08)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-H (M3))
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Colorado Denver
Schools of Medicine
United States
Zip Code
Boguslav, Mayla; Cohen, K Bretonnel; Baumgartner, William A et al. (2018) Improving precision in concept normalization. Pac Symp Biocomput 23:566-577
Cohen, K Bretonnel; Lanfranchi, Arrick; Choi, Miji Joo-Young et al. (2017) Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics 18:372
Verspoor, Karin; Cohen, K Bretonnel; Hunter, Lawrence (2009) The textual characteristics of traditional and Open Access scientific journals are similar. BMC Bioinformatics 10:183
Cohen, K Bretonnel; Palmer, Martha; Hunter, Lawrence (2008) Nominalization and alternations in biomedical language. PLoS One 3:e3158
Zweigenbaum, Pierre; Demner-Fushman, Dina; Yu, Hong et al. (2007) Frontiers of biomedical text mining: current progress. Brief Bioinform 8:358-75