Growing deployments of electronic health records (EHRs) systems have made massive clinical data available electronically. However, much of detailed clinical information of patients is embedded in narrative text and is not directly accessible for computerized clinical applications. Therefore, natural language processing (NLP) technologies, which can unlock information in narrative document, have received great attention in the medical domain. Current state-of-the-art NLP approaches often involve building probabilistic models. However, the wide adoption of statistical methods in clinical NLP faces two grand challenges: 1) the lack of large annotated clinical corpora;and 2) the lack of methodologies that can efficiently integrate linguistic and domain knowledge with statistical learning. High-performance statistical NLP methods rely on large scale and high quality annotations of clinical text, but it is time-consuming and costly to create large annotated clinica corpora as it often requires manual review by physicians. Moreover, the medical domain is knowledge intensive. To achieve optimal performance, probabilistic models need to leverage medical domain knowledge. Therefore, methods that can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost would be highly desirable for clinical text processing. In this study, we propose to investigate interactive machine learning (IML) methods to address the above challenges in clinical NLP. An IML system builds a classification model in an iterative process, which can actively select informative samples for annotation based on models built on previously annotated samples, thus reducing the annotation cost for model development. More importantly, an IML system also involves human inputs to the learning process (e.g., an expert can specify important features for a classification task based on domain knowledge). Thus, IML is an ideal framework for efficiently integrating rule-based (via domain experts specifying features) and statistics-based (via different learning algorithms) approaches to clinical NLP. To achieve our goal, we propose three specific aims.
In Aim 1, we plan to investigate different aspects of IML for word sense disambiguation, including developing new active learning algorithms and conducting cognitive usability analysis for efficient feature annotation by users. To demonstrate the broad uses of IML, we further extend IML approaches to two other important clinical NLP classification tasks: named entity recognition and clinical phenoytping in Aim 2. Finally we propose to disseminate the IML methods and tools to the biomedical research community in Aim 3.

Public Health Relevance

In this project, we propose to develop interactive machine learning methods to process clinical text stored in electronic health records (EHRs) systems. Such methods can efficiently integrate domain and expert knowledge with machine learning processes to quickly build high-quality probabilistic models with minimum annotation cost, thus improving performance of text processors. This technology will allow more accurate data extraction from clinical documents, thus to facilitate clinical research that rely on EHRs data.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Health Science Center Houston
Schools of Allied Health Profes
United States
Zip Code
Wu, Yonghui; Lei, Jianbo; Wei, Wei-Qi et al. (2013) Analyzing differences between chinese and english clinical text: a cross-institution comparison of discharge summaries in two languages. Stud Health Technol Inform 192:662-6
Chen, Yukun; Cao, Hongxin; Mei, Qiaozhu et al. (2013) Applying active learning to supervised word sense disambiguation in MEDLINE. J Am Med Inform Assoc 20:1001-6
Chen, Yukun; Mani, Subramani; Xu, Hua (2012) Applying active learning to assertion classification of concepts in clinical text. J Biomed Inform 45:265-72
Wu, Yonghui; Rosenbloom, S Trent; Denny, Joshua C et al. (2011) Detecting abbreviations in discharge summaries using machine learning methods. AMIA Annu Symp Proc 2011:1541-9