As electronic health records (EHRs) continue their expansion into clinical settings, there has been a corresponding increase in interest in mining the data they contain, both for research as well as for clinical decision support. Informaticists are increasingly studying ways to mine EHR textual content. This is an important trend, because there is a wealth of information contained in clinical text not represented anywhere else in the EHR. There is a low level text-as-data issue which presents a significant obstacle to the widespread use of available medical NLP systems: hand-typed clinical narratives in EHRs are usually ungrammatical;short or telegraphic in style;full of abbreviations, acronyms, and misspellings;formatted in a templated or pseudo-tabular form;and contain embedded non-text such as a list of laboratory values cut-and-pasted from elsewhere in the EHR. As we show in the Preliminary Studies Section, this makes high-level processing by popular tools like MedLEE and MetaMap effectively useless for all but a few """"""""clean"""""""" document types like discharge summaries or consult reports (e.g., pathology or radiology reports). This in turn explains why there is so little published about what is certainly the preponderance of clinical texts, those that are not as well-behaved lexically and syntactically as a discharge summary. In this application we distinguish clinical narratives (e.g., a progress note) from biomedical narratives (e.g., a PubMed abstract). We are interested in texts that arise in the clinical or research setting;texts that are composed by clinicians and researchers directly into a computer system. We propose to build and publish a tool called POET (Parsable Output Extracted from Text). POET will be designed to accept unstructured textual documents and return structured, linguistic equivalents that are, to the extent possible, parsable by higher-level NLP engines. POET will have an architecture is modular, extensible, and based on open-source platforms and sources (e.g., Java, Perl, UMLS, NegEx, the Stanford Parser, HL7 Clinical Document Architecture, caGRID, etc.). To implement POET, we will collect, program, and evaluate published as well as novel algorithms for: acronym/abbreviation resolution;spelling correction;template and pseudo-table re-writing;and removal of embedded non-text. To test POET we will use a large corpus of cross-discipline (e.g., medical, nursing, pharmacy, etc.) clinical note types, as well as the clinical research texts MedWatch reports and IRB adverse event reports. The development of POET will combine the best practices found in the literature and new research efforts as part of the project. To validate the fidelity of POET processing we plan a formal analysis of information loss and information gain pre- and post-process. To ensure broad access to the tools, POET will be released under an open-source license. Finally, we plan to assess the feasibility of offering POET as a Web service for remote processing.

Public Health Relevance

This project attempts the construction of POET, a low-level preprocessing system for full text that can be used to open up large portions of the electronic health record (EHR) to high-level NLP systems. The potential public health implications are: 1) POET will allow the expansion of the use of well-proven clinical NLP systems (currently limited to only a few document types found in the EHR) to the entire clinical text record;with the entirety of the clinical record accessible to NLP, serious and realistic attempts at real-time clinical text surveillance can be mounted to improve patient safety and quality of care; 2) POET will be made available through open source distribution and other means to encourage the practical deployment of innovative decision support systems using large healthcare network EHRs across the country; 3) POET meets an important translational public health informatics need by solving persistent low-level barriers to effective data mining of clinical marriages in the EHR. The wider public health implications include promoting effective computerized decision support and data mining to improve both personal and public health outcomes.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Utah
Schools of Medicine
Salt Lake City
United States
Zip Code
Bradford, Wayne; Hurdle, John F; LaSalle, Bernie et al. (2014) Development of a HIPAA-compliant environment for translational research data and analytics. J Am Med Inform Assoc 21:185-9
Patterson, Olga; Hurdle, John F (2011) Document clustering of clinical narratives: a systematic study of clinical sublanguages. AMIA Annu Symp Proc 2011:1099-107
Workman, T Elizabeth; Hurdle, John F (2011) Dynamic summarization of bibliographic-based data. BMC Med Inform Decis Mak 11:6
Kim, Youngjun; Hurdle, John; Meystre, Stéphane M (2011) Using UMLS lexical resources to disambiguate abbreviations in clinical text. AMIA Annu Symp Proc 2011:715-22
Meystre, Stéphane M; Thibault, Julien; Shen, Shuying et al. (2010) Automatically detecting medications and the reason for their prescription in clinical narrative text documents. Stud Health Technol Inform 160:944-8
Meystre, Stéphane M; Thibault, Julien; Shen, Shuying et al. (2010) Textractor: a hybrid system for medications and reason for their prescription extraction from clinical text documents. J Am Med Inform Assoc 17:559-62
Patterson, Olga; Igo, Sean; Hurdle, John F (2010) Automatic acquisition of sublanguage semantic schema: towards the word sense disambiguation of clinical narratives. AMIA Annu Symp Proc 2010:612-6