Beyond Abstracts:  Issues in Mining Full Texts

Hunter, Lawrence

Abstract

? Biomedical language processing, the application of computational techniques to human-generated texts in biomedicine, is an increasingly important enabling technology for basic and applied biomedical research. The exponential growth of the peer-reviewed literature and the breakdown of disciplinary boundaries associated with high-throughput techniques have increased the importance of automated tools for keeping scientists abreast of all of the published material relevant to their work. However, despite decades of research, the performance of state-of-the-art tools for basic language processing tasks like information extraction and document retrieval remain below the level necessary for adequate utility and widespread adoption of this technology. The development, performance and evaluation of text mining systems depend crucially on the availability of appropriate corpora: collections of representative documents that have been annotated with human judgments relevant to a language-processing task. Corpora play two roles in the development of this technology: first, they act as """"""""gold standards"""""""" by which alternative automated methods can be fairly compared, and second, they provide data for the training of statistical and machine learning systems that create empirical models of patterns in language use. The conventional view is that corpora are neutral, random samples of the domain of interest. Our preliminary work suggests that the restrictions in size, quality, genre, and representational schema of the small number of existing corpora are themselves a critical limiting factor for near-term breakthroughs in biomedical text processing technology. Therefore, we propose to test the following hypothesis: Creation of large, high-quality, biomedical corpora from multiple genres will lead to significant improvements in the performance of biomedical text mining systems and the creation of new approaches to text mining tasks.
Specific aims i nclude constructing several large corpora covering a range of genres and incorporating a rich knowledge representation; identifying factors that affect differential performance on full text versus abstracts; and developing new methods for language processing, especially of full text. Because improvements in the ability to automatically extract information from many textual genres will assist scientists and clinicians in the crucial task of keeping up with the burgeoning biomedical literature, the potential public health impact is quite large. ? ? ?

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM009254-02
Application #: 7287359
Study Section: Special Emphasis Panel (ZLM1-ZH-S (M3))
Program Officer: Sim, Hua-Chuan

Project Start: 2006-09-15
Project End: 2009-09-14
Budget Start: 2007-09-15
Budget End: 2008-09-14
Support Year: 2
Fiscal Year: 2007
Total Cost: $350,638
Indirect Cost

Institution

Name: University of Colorado Denver
Department: Pharmacology
Type: Schools of Medicine
DUNS #: 041096314

City: Aurora
State: CO
Country: United States
Zip Code: 80045

Related projects

Publications

Boguslav, Mayla; Cohen, K Bretonnel; Baumgartner, William A et al. (2018) Improving precision in concept normalization. Pac Symp Biocomput 23:566-577

Cohen, K Bretonnel; Xia, Jingbo; Zweigenbaum, Pierre et al. (2018) Three Dimensions of Reproducibility in Natural Language Processing. LREC Int Conf Lang Resour Eval 2018:156-165

Callahan, Tiffany J; Baumgartner, William A; Bada, Michael et al. (2018) OWL-NETS: Transforming OWL Representations for Improved Network Inference. Pac Symp Biocomput 23:133-144

Cohen, K Bretonnel; Lanfranchi, Arrick; Choi, Miji Joo-Young et al. (2017) Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles. BMC Bioinformatics 18:372

Kao, David P; Stevens, Laura M; Hinterberg, Michael A et al. (2017) Phenotype-Specific Association of Single-Nucleotide Polymorphisms with Heart Failure and Preserved Ejection Fraction: a Genome-Wide Association Analysis of the Cardiovascular Health Study. J Cardiovasc Transl Res 10:285-294

Hooper, Joan E; Feng, Weiguo; Li, Hong et al. (2017) Systems biology of facial development: contributions of ectoderm and mesenchyme. Dev Biol 426:97-114

Cohen, K Bretonnel; Fort, Karën; Adda, Gilles et al. (2016) Ethical Issues in Corpus Linguistics And Annotation: Pay Per Hit Does Not Affect Effective Hourly Rate For Linguistic Resource Development On Amazon Mechanical Turk. LREC Int Conf Lang Resour Eval 2016:8-12

Cohen, K Bretonnel; Xia, Jingbo; Roeder, Christophe et al. (2016) Reproducibility in Natural Language Processing: A Case Study of Two R Libraries for Mining PubMed/MEDLINE. LREC Int Conf Lang Resour Eval 2016:6-12

Cohen, K Bretonnel; Baumgartner Jr, William A; Temnikova, Irina (2016) SuperCAT: The (New and Improved) Corpus Analysis Toolkit. LREC Int Conf Lang Resour Eval 2016:2784-2788

Eberlein, Jens; Davenport, Bennett; Nguyen, Tom et al. (2016) Aging promotes acquisition of naive-like CD8+ memory T cell traits and enhanced functionalities. J Clin Invest 126:3942-3960

Showing the most recent 10 out of 54 publications

Comments

Be the first to comment on Lawrence Hunter's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: