This project focuses on clinical natural language processing (cNLP), a field of emerging importance in informatics. Starting with the Linguistic String Project's Medical Language Processor (New York University) in the 1970s, researchers have made steady gains in cNLP through empirical studies and by building sophisticated high-level cNLP software applications (e.g., Columbia's MedLEE). There are no fewer than four scientific conferences devoted exclusively to biomedical/clinical NLP. The cNLP literature has been growing over the past decade, and this will gain momentum as more clinical text repositories are released, such as the MIMIC II and University of Pittsburgh BLU Lab corpora. However, sustained success in the field of cNLP is hampered by the reality that clinical texts have a far more noise than do texts traditionally studied in NLP, such as newswire articles, biomedical abstracts, and discharge summaries. Noise in this context is defined by the parseability characteristics of the language and the linguistic structures that appear in text. Clinical texts come in a striking variety of note types, with the best studied types being discharge summaries, radiology reports, and pathology reports. These note types share an important feature: they are written to communicate care issues between healthcare providers and hence typically are well-composed, well-edited, and often are dictated. But the vast majority of notes in the electronic health record are written primarily to document care issues. They communicate as well, of course, but much less care is used in their creation than with discharge summaries and reports. As a result they are often ungrammatical;are composed of short, telegraphic phrases;are replete with misspellings and shorthand (e.g., abbreviations);are ill-formatted with templates and liberal use of white space;and are embedded with "non-prose" (e.g., strings of laboratory values). All of these sources of noise complicate otherwise straightforward NLP tasks like tokenization, sentence segmentation, and ultimately information extraction itself. We propose a systematic study of ways to increase the signal-to-noise ratio in clinical narratives to improve cNLP. This work extends our preliminary research (under the POET project) and has the following aims: o Develop and implement a suite of parseability improvement tools designed for all clinical note types from multiple healthcare institutions. o Evaluate the empirical and the functional success of the parseability improvement tools. o Design and implement a HIPAA-compliant UlMA-based pipeline cNLP framework for use in a typical high-performance, multi-processor computing environment.

Public Health Relevance

We can see in the multi-billion dollar investment in electronic health records (EHRs) by the ARRA that mining clinical data electronically will continue to be essential to informatics research. Most data in the EHR resides as unstructured text, and POET2 provides a means to unlock that data through combining a new, HIPAA- complaint high-performance computing architecture with sophisticated text preprocessing.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1-ZH-C (01))
Program Officer
Vanbiervliet, Alan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Utah
Schools of Medicine
Salt Lake City
United States
Zip Code
Bradford, Wayne; Hurdle, John F; LaSalle, Bernie et al. (2014) Development of a HIPAA-compliant environment for translational research data and analytics. J Am Med Inform Assoc 21:185-9
Jones, David E; Igo, Sean; Hurdle, John et al. (2014) Automatic extraction of nanoparticle properties using natural language processing: NanoSifter an application to acquire PAMAM dendrimer properties. PLoS One 9:e83932