Growing deployments of electronic health records (EHRs) systems have made massive clinical data available electronically. However, much of detailed clinical information of patients is embedded in narrative text and is not directly accessible for computerized clinical applications. Therefore, natural language processing (NLP) technologies, which can unlock information in narrative document, have received great attention in the medical domain. Current state-of-the-art NLP approaches often involve building probabilistic models. However, the wide adoption of statistical methods in clinical NLP faces two grand challenges: 1) the lack of large annotated clinical corpora;and 2) the lack of methodologies that can efficiently integrate linguistic and domain knowledge with statistical learning. High-performance statistical NLP methods rely on large scale and high quality annotations of clinical text, but it is time-consuming and costly to create large annotated clinica corpora as it often requires manual review by physicians. Moreover, the medical domain is knowledge intensive. To achieve optimal performance, probabilistic models need to leverage medical domain knowledge. Therefore, methods that can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost would be highly desirable for clinical text processing. In this study, we propose to investigate interactive machine learning (IML) methods to address the above challenges in clinical NLP. An IML system builds a classification model in an iterative process, which can actively select informative samples for annotation based on models built on previously annotated samples, thus reducing the annotation cost for model development. More importantly, an IML system also involves human inputs to the learning process (e.g., an expert can specify important features for a classification task based on domain knowledge). Thus, IML is an ideal framework for efficiently integrating rule-based (via domain experts specifying features) and statistics-based (via different learning algorithms) approaches to clinical NLP. To achieve our goal, we propose three specific aims.
In Aim 1, we plan to investigate different aspects of IML for word sense disambiguation, including developing new active learning algorithms and conducting cognitive usability analysis for efficient feature annotation by users. To demonstrate the broad uses of IML, we further extend IML approaches to two other important clinical NLP classification tasks: named entity recognition and clinical phenoytping in Aim 2. Finally we propose to disseminate the IML methods and tools to the biomedical research community in Aim 3.

Public Health Relevance

In this project, we propose to develop interactive machine learning methods to process clinical text stored in electronic health records (EHRs) systems. Such methods can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost, thus improving performance of text processors. This technology will allow more accurate data extraction from clinical documents, thus to facilitate clinical research that rely on EHRs data.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Health Science Center Houston
Schools of Allied Health Profes
United States
Zip Code
Zhang, Yaoyun; Xu, Jun; Chen, Hui et al. (2016) Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning. Database (Oxford) 2016:
Xu, Jun; Wu, Yonghui; Zhang, Yaoyun et al. (2016) CD-REST: a system for extracting chemical-induced disease relation in literature. Database (Oxford) 2016:
Zhang, Yaoyun; Soysal, Ergin; Moon, Sungrim et al. (2015) Integrating Multiple On-line Knowledge Bases for Disease-Lab Test Relation Extraction. AMIA Jt Summits Transl Sci Proc 2015:204-8
Wu, Yonghui; Xu, Jun; Jiang, Min et al. (2015) A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text. AMIA Annu Symp Proc 2015:1326-33
Wu, Yonghui; Jiang, Min; Lei, Jianbo et al. (2015) Named Entity Recognition in Chinese Clinical Text Using Deep Neural Network. Stud Health Technol Inform 216:624-8
Xu, Jun; Zhang, Yaoyun; Wu, Yonghui et al. (2015) Citation Sentiment Analysis in Clinical Trial Papers. AMIA Annu Symp Proc 2015:1334-41
Jiang, Min; Huang, Yang; Fan, Jung-wei et al. (2015) Parsing clinical text: how good are the state-of-the-art parsers? BMC Med Inform Decis Mak 15 Suppl 1:S2
Chen, Yukun; Lasko, Thomas A; Mei, Qiaozhu et al. (2015) A study of active learning methods for named entity recognition in clinical text. J Biomed Inform 58:11-8
Wu, Y; Denny, J C; Rosenbloom, S T et al. (2015) A Preliminary Study of Clinical Abbreviation Disambiguation in Real Time. Appl Clin Inform 6:364-74
Wu, Yonghui; Lei, Jianbo; Wei, Wei-Qi et al. (2013) Analyzing differences between chinese and english clinical text: a cross-institution comparison of discharge summaries in two languages. Stud Health Technol Inform 192:662-6

Showing the most recent 10 out of 20 publications