Growing deployments of electronic health records (EHRs) systems have made massive clinical data available electronically. However, much of detailed clinical information of patients is embedded in narrative text and is not directly accessible for computerized clinical applications. Therefore, natural language processing (NLP) technologies, which can unlock information in narrative document, have received great attention in the medical domain. Current state-of-the-art NLP approaches often involve building probabilistic models. However, the wide adoption of statistical methods in clinical NLP faces two grand challenges: 1) the lack of large annotated clinical corpora;and 2) the lack of methodologies that can efficiently integrate linguistic and domain knowledge with statistical learning. High-performance statistical NLP methods rely on large scale and high quality annotations of clinical text, but it is time-consuming and costly to create large annotated clinica corpora as it often requires manual review by physicians. Moreover, the medical domain is knowledge intensive. To achieve optimal performance, probabilistic models need to leverage medical domain knowledge. Therefore, methods that can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost would be highly desirable for clinical text processing. In this study, we propose to investigate interactive machine learning (IML) methods to address the above challenges in clinical NLP. An IML system builds a classification model in an iterative process, which can actively select informative samples for annotation based on models built on previously annotated samples, thus reducing the annotation cost for model development. More importantly, an IML system also involves human inputs to the learning process (e.g., an expert can specify important features for a classification task based on domain knowledge). Thus, IML is an ideal framework for efficiently integrating rule-based (via domain experts specifying features) and statistics-based (via different learning algorithms) approaches to clinical NLP. To achieve our goal, we propose three specific aims.
In Aim 1, we plan to investigate different aspects of IML for word sense disambiguation, including developing new active learning algorithms and conducting cognitive usability analysis for efficient feature annotation by users. To demonstrate the broad uses of IML, we further extend IML approaches to two other important clinical NLP classification tasks: named entity recognition and clinical phenoytping in Aim 2. Finally we propose to disseminate the IML methods and tools to the biomedical research community in Aim 3.

Public Health Relevance

In this project, we propose to develop interactive machine learning methods to process clinical text stored in electronic health records (EHRs) systems. Such methods can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost, thus improving performance of text processors. This technology will allow more accurate data extraction from clinical documents, thus to facilitate clinical research that rely on EHRs data.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Health Science Center Houston
Schools of Allied Health Profes
United States
Zip Code
Lee, Hee-Jin; Zhang, Yaoyun; Jiang, Min et al. (2018) Identifying direct temporal relations between time and events from clinical notes. BMC Med Inform Decis Mak 18:49
Brusco, Lauren L; Wathoo, Chetna; Mills Shaw, Kenna R et al. (2018) Physician interpretation of genomic test results and treatment selection. Cancer 124:966-972
Zhang, Yaoyun; Zhang, Olivia; Wu, Yonghui et al. (2017) Psychiatric symptom recognition without labeled data using distributional representations of phrases and on-line knowledge. J Biomed Inform 75S:S129-S137
Wu, Yonghui; Jiang, Min; Xu, Jun et al. (2017) Clinical Named Entity Recognition Using Deep Learning Models. AMIA Annu Symp Proc 2017:1812-1819
Ji, Zongcheng; Zhang, Yaoyun; Xu, Jun et al. (2017) Comparing Cancer Information Needs for Consumers in the US and China. Stud Health Technol Inform 245:126-130
Lee, Hee-Jin; Zhang, Yaoyun; Roberts, Kirk et al. (2017) Leveraging existing corpora for de-identification of psychiatric notes using domain adaptation. AMIA Annu Symp Proc 2017:1070-1079
Lee, Hee-Jin; Wu, Yonghui; Zhang, Yaoyun et al. (2017) A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform 75S:S19-S27
Wang, Yue; Zheng, Kai; Xu, Hua et al. (2016) Clinical Word Sense Disambiguation with Interactive Search and Classification. AMIA Annu Symp Proc 2016:2062-2071
Zhang, Yaoyun; Xu, Jun; Chen, Hui et al. (2016) Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning. Database (Oxford) 2016:
Duan, Rui; Cao, Ming; Wu, Yonghui et al. (2016) An Empirical Study for Impacts of Measurement Errors on EHR based Association Studies. AMIA Annu Symp Proc 2016:1764-1773

Showing the most recent 10 out of 30 publications