Recent developments in text mining research, and in scientific publication, have brought us to the moment when the long-standing potential of natural language processing technology to benefit biomedical researchers may finally be realized. Technological advances, recent results in computational linguistics, maturation of biomedical ontology, and the advent of resources such as PubMedCentral have set the stage for an attempt at an integrated computational analysis of a large proportion of the full text biomedical literature. Such an analysis has the potential to dramatically extend the way that biomedical researchers can effectively use the scientific literature, particularly in the analysis of genome-scale datasets, broadly accelerating and increasing the efficiency of scientific discovery. We hypothesize that it is now possible to extract a wide variety of ontologically-grounded entities and relationships by processing the entire PubMedCentral document collection accurately and with good coverage, to use this extracted information to produce new genres of scientifically valuable tools and analysis techniques, and to demonstrate its utility in the analysis of genome-scale data. The challenges that we plan to overcome range from fundamental linguistic issues (e.g. cross- document coreference resolution) to high-performance computing (e.g. scaling up integrated processing to include millions of complex documents), to fielding practical systems that can exploit enormous knowledge-bases to accelerate the analysis of very large molecular data sets.

Public Health Relevance

Enormous amounts of biomedical information are now available in the PubMedCentral database, but computers cannot work with it because it is in the form of human-language text and humans can't read it all due to its large volume. The goal of this project is to harvest large amounts of that information automatically, making it available to humans in summarized form and to computers in computer-readable form.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-C (01))
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Colorado Denver
Schools of Medicine
United States
Zip Code
Boguslav, Mayla; Cohen, K Bretonnel; Baumgartner, William A et al. (2018) Improving precision in concept normalization. Pac Symp Biocomput 23:566-577
Cohen, K Bretonnel; Xia, Jingbo; Zweigenbaum, Pierre et al. (2018) Three Dimensions of Reproducibility in Natural Language Processing. LREC Int Conf Lang Resour Eval 2018:156-165
Callahan, Tiffany J; Baumgartner, William A; Bada, Michael et al. (2018) OWL-NETS: Transforming OWL Representations for Improved Network Inference. Pac Symp Biocomput 23:133-144
Cohen, K Bretonnel; Lanfranchi, Arrick; Choi, Miji Joo-Young et al. (2017) Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics 18:372
Kao, David P; Stevens, Laura M; Hinterberg, Michael A et al. (2017) Phenotype-Specific Association of Single-Nucleotide Polymorphisms with Heart Failure and Preserved Ejection Fraction: a Genome-Wide Association Analysis of the Cardiovascular Health Study. J Cardiovasc Transl Res 10:285-294
Hooper, Joan E; Feng, Weiguo; Li, Hong et al. (2017) Systems biology of facial development: contributions of ectoderm and mesenchyme. Dev Biol 426:97-114
Névéol, Aurélie; Cohen, K Bretonnel; Grouin, Cyril et al. (2016) Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016. CEUR Workshop Proc 1609:28-42
Cohen, K Bretonnel; Fort, Karën; Adda, Gilles et al. (2016) Ethical Issues in Corpus Linguistics And Annotation: Pay Per Hit Does Not Affect Effective Hourly Rate For Linguistic Resource Development On Amazon Mechanical Turk. LREC Int Conf Lang Resour Eval 2016:8-12
Cohen, K Bretonnel; Xia, Jingbo; Roeder, Christophe et al. (2016) Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE. LREC Int Conf Lang Resour Eval 2016:6-12
Cohen, K Bretonnel; Baumgartner Jr, William A; Temnikova, Irina (2016) SuperCAT: The (New and Improved) Corpus Analysis Toolkit. LREC Int Conf Lang Resour Eval 2016:2784-2788

Showing the most recent 10 out of 54 publications