? Biomedical language processing, the application of computational techniques to human-generated texts in biomedicine, is an increasingly important enabling technology for basic and applied biomedical research. The exponential growth of the peer-reviewed literature and the breakdown of disciplinary boundaries associated with high-throughput techniques have increased the importance of automated tools for keeping scientists abreast of all of the published material relevant to their work. However, despite decades of research, the performance of state-of-the-art tools for basic language processing tasks like information extraction and document retrieval remain below the level necessary for adequate utility and widespread adoption of this technology. The development, performance and evaluation of text mining systems depend crucially on the availability of appropriate corpora: collections of representative documents that have been annotated with human judgments relevant to a language-processing task. Corpora play two roles in the development of this technology: first, they act as """"""""gold standards"""""""" by which alternative automated methods can be fairly compared, and second, they provide data for the training of statistical and machine learning systems that create empirical models of patterns in language use. The conventional view is that corpora are neutral, random samples of the domain of interest. Our preliminary work suggests that the restrictions in size, quality, genre, and representational schema of the small number of existing corpora are themselves a critical limiting factor for near-term breakthroughs in biomedical text processing technology. Therefore, we propose to test the following hypothesis: Creation of large, high-quality, biomedical corpora from multiple genres will lead to significant improvements in the performance of biomedical text mining systems and the creation of new approaches to text mining tasks.
Specific aims i nclude constructing several large corpora covering a range of genres and incorporating a rich knowledge representation; identifying factors that affect differential performance on full text versus abstracts; and developing new methods for language processing, especially of full text. Because improvements in the ability to automatically extract information from many textual genres will assist scientists and clinicians in the crucial task of keeping up with the burgeoning biomedical literature, the potential public health impact is quite large. ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM009254-02
Application #
7287359
Study Section
Special Emphasis Panel (ZLM1-ZH-S (M3))
Program Officer
Sim, Hua-Chuan
Project Start
2006-09-15
Project End
2009-09-14
Budget Start
2007-09-15
Budget End
2008-09-14
Support Year
2
Fiscal Year
2007
Total Cost
$350,638
Indirect Cost
Name
University of Colorado Denver
Department
Pharmacology
Type
Schools of Medicine
DUNS #
041096314
City
Aurora
State
CO
Country
United States
Zip Code
80045
Boguslav, Mayla; Cohen, K Bretonnel; Baumgartner, William A et al. (2018) Improving precision in concept normalization. Pac Symp Biocomput 23:566-577
Cohen, K Bretonnel; Xia, Jingbo; Zweigenbaum, Pierre et al. (2018) Three Dimensions of Reproducibility in Natural Language Processing. LREC Int Conf Lang Resour Eval 2018:156-165
Callahan, Tiffany J; Baumgartner, William A; Bada, Michael et al. (2018) OWL-NETS: Transforming OWL Representations for Improved Network Inference. Pac Symp Biocomput 23:133-144
Cohen, K Bretonnel; Lanfranchi, Arrick; Choi, Miji Joo-Young et al. (2017) Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics 18:372
Kao, David P; Stevens, Laura M; Hinterberg, Michael A et al. (2017) Phenotype-Specific Association of Single-Nucleotide Polymorphisms with Heart Failure and Preserved Ejection Fraction: a Genome-Wide Association Analysis of the Cardiovascular Health Study. J Cardiovasc Transl Res 10:285-294
Hooper, Joan E; Feng, Weiguo; Li, Hong et al. (2017) Systems biology of facial development: contributions of ectoderm and mesenchyme. Dev Biol 426:97-114
Cohen, K Bretonnel; Fort, Karën; Adda, Gilles et al. (2016) Ethical Issues in Corpus Linguistics And Annotation: Pay Per Hit Does Not Affect Effective Hourly Rate For Linguistic Resource Development On Amazon Mechanical Turk. LREC Int Conf Lang Resour Eval 2016:8-12
Cohen, K Bretonnel; Xia, Jingbo; Roeder, Christophe et al. (2016) Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE. LREC Int Conf Lang Resour Eval 2016:6-12
Cohen, K Bretonnel; Baumgartner Jr, William A; Temnikova, Irina (2016) SuperCAT: The (New and Improved) Corpus Analysis Toolkit. LREC Int Conf Lang Resour Eval 2016:2784-2788
Eberlein, Jens; Davenport, Bennett; Nguyen, Tom et al. (2016) Aging promotes acquisition of naive-like CD8+ memory T cell traits and enhanced functionalities. J Clin Invest 126:3942-3960

Showing the most recent 10 out of 54 publications