This project focuses on clinical natural language processing (cNLP), a field of emerging importance in informatics. Starting with the Linguistic String Project's Medical Language Processor (New York University) in the 1970s, researchers have made steady gains in cNLP through empirical studies and by building sophisticated high-level cNLP software applications (e.g., Columbia's MedLEE). There are no fewer than four scientific conferences devoted exclusively to biomedical/clinical NLP. The cNLP literature has been growing over the past decade, and this will gain momentum as more clinical text repositories are released, such as the MIMIC II and University of Pittsburgh BLU Lab corpora. However, sustained success in the field of cNLP is hampered by the reality that clinical texts have a far more noise than do texts traditionally studied in NLP, such as newswire articles, biomedical abstracts, and discharge summaries. Noise in this context is defined by the parseability characteristics of the language and the linguistic structures that appear in text. Clinical texts come in a striking variety of note types, with the best studied types being discharge summaries, radiology reports, and pathology reports. These note types share an important feature: they are written to communicate care issues between healthcare providers and hence typically are well-composed, well-edited, and often are dictated. But the vast majority of notes in the electronic health record are written primarily to document care issues. They communicate as well, of course, but much less care is used in their creation than with discharge summaries and reports. As a result they are often ungrammatical;are composed of short, telegraphic phrases;are replete with misspellings and shorthand (e.g., abbreviations);are ill-formatted with templates and liberal use of white space;and are embedded with """"""""non-prose"""""""" (e.g., strings of laboratory values). All of these sources of noise complicate otherwise straightforward NLP tasks like tokenization, sentence segmentation, and ultimately information extraction itself. We propose a systematic study of ways to increase the signal-to-noise ratio in clinical narratives to improve cNLP. This work extends our preliminary research (under the POET project) and has the following aims: o Develop and implement a suite of parseability improvement tools designed for all clinical note types from multiple healthcare institutions. o Evaluate the empirical and the functional success of the parseability improvement tools. o Design and implement a HIPAA-compliant UlMA-based pipeline cNLP framework for use in a typical high-performance, multi-processor computing environment.

Public Health Relevance

We can see in the multi-billion dollar investment in electronic health records (EHRs) by the ARRA that mining clinical data electronically will continue to be essential to informatics research. Most data in the EHR resides as unstructured text, and POET2 provides a means to unlock that data through combining a new, HIPAA- complaint high-performance computing architecture with sophisticated text preprocessing.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-C (01))
Program Officer
Vanbiervliet, Alan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Utah
Schools of Medicine
Salt Lake City
United States
Zip Code
Bui, Duy Duc An; Del Fiol, Guilherme; Hurdle, John F et al. (2016) Extractive text summarization system to aid data extraction from full text in systematic review development. J Biomed Inform 64:265-272
Kim, Youngjun; Riloff, Ellen; Hurdle, John F (2015) A Study of Concept Extraction Across Different Types of Clinical Notes. AMIA Annu Symp Proc 2015:737-46
Doing-Harris, Kristina M; Weir, Charlene R; Igo, Sean et al. (2015) POETenceph - Automatic identification of clinical notes indicating encephalopathy using a realist ontology. AMIA Annu Symp Proc 2015:512-21
Jones, David E; Igo, Sean; Hurdle, John et al. (2014) Automatic extraction of nanoparticle properties using natural language processing: NanoSifter an application to acquire PAMAM dendrimer properties. PLoS One 9:e83932
Bradford, Wayne; Hurdle, John F; LaSalle, Bernie et al. (2014) Development of a HIPAA-compliant environment for translational research data and analytics. J Am Med Inform Assoc 21:185-9
Doing-Harris, Kristina; Patterson, Olga; Igo, Sean et al. (2013) Document Sublanguage Clustering to Detect Medical Specialty in Cross-institutional Clinical Texts. Proc ACM Int Workshop Data Text Min Biomed Inform 2013:9-12
Pestian, John P; Matykiewicz, Pawel; Linn-Gust, Michelle et al. (2012) Sentiment Analysis of Suicide Notes: A Shared Task. Biomed Inform Insights 5:3-16
Workman, T Elizabeth; Fiszman, Marcelo; Hurdle, John F (2012) Text summarization as a decision support aid. BMC Med Inform Decis Mak 12:41
Dorr, David A; Cohen, Aaron M; Williams, Marsha Pierre-Jacques et al. (2011) From simply inaccurate to complex and inaccurate: complexity in standards-based quality measures. AMIA Annu Symp Proc 2011:331-8
Kim, Youngjun; Hurdle, John; Meystre, St├ęphane M (2011) Using UMLS lexical resources to disambiguate abbreviations in clinical text. AMIA Annu Symp Proc 2011:715-22