Patient information in the electronic health record (EHR) such as lab results, medications, and past medical history is the basis for physician decisions about patient care. It also helps patients better understand and manage their care. Efficient access to this patient information is thus essential. One of the most intuitive ways of accessing data is by asking natural language questions. A significant amount of work in medical question answering has been conducted, yet little work has been performed in question answering for EHRs. Natural language questions can be represented in logical forms, a standard structured knowledge representation technique. This project proposes to take natural language EHR questions, both for doctors and patients, and automatically convert them to a logical form. The logical forms can then be converted to a structured query such as those used by EHRs. A major obstacle to this approach is the lack of data containing questions annotated with logical forms. This project hypothesizes that a small set of questions can be manually annotated, and then paraphrases can be produced for each annotated question. Since paraphrasing is a simpler task than logical form annotation, crowd-sourcing techniques can be used to collect thousands of question paraphrases. This question paraphrase corpus will then be used to build a semantic grammar capable of recognizing the logical structure of EHR questions. To ensure a robust, generalizable grammar, existing NLP techniques will be used to pre-process questions, simplifying their syntactic structure and abstracting their medical concepts. In order to develop such a method, the candidate, Dr. Kirk Roberts, requires additional training and mentoring in natural language processing and biomedical informatics. This application for the NIH Pathway to Independence Award (K99/R00) describes a career development plan that will allow Dr. Roberts to achieve the goals of this project as well as transition to a career as an independent researcher. He will be mentored by Dr. Dina Demner-Fushman, a leading medical NLP researcher, and co-mentored by Dr. Clement McDonald, a leading EHR and medical informatics researcher.
The specific aims of the project are: (1) Build a paraphrase collection of EHR questions, where each prototype question will have many unique paraphrases. The paraphrases encompass different lexical and syntactic means of conveying the same logical form. (2) Construct a semantic grammar for EHR questions. The grammar can then be used to convert a natural language question to a logical form. (3) Implement an end- to-end question analyzer that generalizes EHR questions for improved parsing, parses the question into a logical form using the grammar, and converts the logical form into a leading structured EHR query format.

Public Health Relevance

The proposed work aims to significantly improve the ability of both doctors and patients to find information within electronic health records (EHR). By providing an interface to EHRs where users can specify their information needs in the form of a natural language question, the proposed work provides a more intuitive means of finding patient data than is currently available.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Transition Award (R00)
Project #
Application #
Study Section
Special Emphasis Panel (NSS)
Program Officer
Vanbiervliet, Alan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Health Science Center Houston
Sch Allied Health Professions
United States
Zip Code
Demner-Fushman, Dina; Shooshan, Sonya E; Rodriguez, Laritza et al. (2018) A dataset of 200 structured product labels annotated for adverse drug reactions. Sci Data 5:180001
Zhang, Yaoyun; Li, Hee-Jin; Wang, Jingqi et al. (2018) Adapting Word Embeddings from Multiple Domains to Symptom Recognition from Psychiatric Notes. AMIA Jt Summits Transl Sci Proc 2017:281-289
Zhang, Yaoyun; Zhang, Olivia; Wu, Yonghui et al. (2017) Psychiatric symptom recognition without labeled data using distributional representations of phrases and on-line knowledge. J Biomed Inform 75S:S129-S137
Lee, Hee-Jin; Zhang, Yaoyun; Roberts, Kirk et al. (2017) Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA Annu Symp Proc 2017:1070-1079
Lee, Hee-Jin; Wu, Yonghui; Zhang, Yaoyun et al. (2017) A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform 75S:S19-S27
Mrabet, Yassine; Kilicoglu, Halil; Roberts, Kirk et al. (2016) Combining Open-domain and Biomedical Knowledge for Topic Recognition in Consumer Health Questions. AMIA Annu Symp Proc 2016:914-923
Roberts, Kirk; Demner-Fushman, Dina (2016) Annotating Logical Forms for EHR Questions. LREC Int Conf Lang Resour Eval 2016:3772-3778
Roberts, Kirk; Rodriguez, Laritza; Shooshan, Sonya E et al. (2016) Resource Classification for Medical Questions. AMIA Annu Symp Proc 2016:1040-1049
Roberts, Kirk; Demner-Fushman, Dina (2016) Interactive use of online health resources: a comparison of consumer and professional questions. J Am Med Inform Assoc 23:802-11