Narratives of electronic health records (EHRs) contain useful information that is difficult to automatically extract, index, search, or interpret. Clinical natural language processing (NLP) technologies for automatic extraction, indexing, searching, and interpretation of EHRs are in development;however, due to privacy concerns related to EHRs, such technologies are usually developed by teams that have privileged access to EHRs in a specific institution. Technologies that are tailored to a specific set of data from a given institution generate inspiring results on that data;however, they can fail to generalize to similar data from other institutions and even other departments from the same institution. Therefore, learning from these technologies and building on them becomes difficult. In order to improve NLP in EHRs, there is need for head-to-head comparison of approaches that can address a given task on the same data set. Shared-tasks provide one way of conducting systematic head-to- head comparisons. This proposal describes a series of shared-task challenges and conferences, spread over a five year period, that promote the development and evaluation of cutting edge clinical NLP systems by distributing de-identified EHRs to the broad research community, under data use agreements, so that: * the state-of-the-art in clinical NLP technologies can be identified and advanced, * a set of technologies that enable the use of the information contained in EHR narratives becomes available, and * the information from EHR narratives can be made more accessible, for example, for clinical and medical research. The scientific activities supporting the organization of the shared-task challenges are sponsored in part by Informatics for Integrating Biology and the Bedside (i2b2), grant number U54-LM008748, PI: Kohane. This proposal aims to organize a series of workshops, conference proceedings, and journal special issues that will accompany the shared-task challenges in order to disseminate the knowledge generated by the challenges.

Public Health Relevance

this proposal will address two main challenges related to the use of clinical narratives for research: availability of clinical records for research and identification of the state of the art in clinical natural language processing (NLP) technologies so that we can push the state of the art forward and so that future work can build on the past. Progress in clinical NLP will improve access to electronic health records for research, and for clinical applications, benefiting healthcare and public health.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Conference (R13)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
State University of New York at Albany
Schools of Arts and Sciences
United States
Zip Code
Karystianis, George; Dehghan, Azad; Kovacevic, Aleksandar et al. (2015) Using local lexicalized rules to identify heart disease risk factors in clinical notes. J Biomed Inform 58 Suppl:S183-8
Zheng, Kai; Vydiswaran, V G Vinod; Liu, Yang et al. (2015) Ease of adoption of clinical natural language processing software: An evaluation of five systems. J Biomed Inform 58 Suppl:S189-96
Shivade, Chaitanya; Malewadkar, Pranav; Fosler-Lussier, Eric et al. (2015) Comparison of UMLS terminologies to identify risk of heart disease using clinical notes. J Biomed Inform 58 Suppl:S103-10
Chen, Qingcai; Li, Haodi; Tang, Buzhou et al. (2015) An automatic system to identify heart disease risk factors in clinical texts over time. J Biomed Inform 58 Suppl:S158-63
Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep et al. (2015) Identification and Progression of Heart Disease Risk Factors in Diabetic Patients from Longitudinal Electronic Health Records. Biomed Res Int 2015:636371
Kotfila, Christopher; Uzuner, Özlem (2015) A systematic comparison of feature space effects on disease classifier performance for phenotype identification of five diseases. J Biomed Inform 58 Suppl:S92-S102
Chen, Tao; Cullen, Richard M; Godwin, Marshall (2015) Hidden Markov model using Dirichlet process for de-identification. J Biomed Inform 58 Suppl:S60-6
Cormack, James; Nath, Chinmoy; Milward, David et al. (2015) Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge. J Biomed Inform 58 Suppl:S120-7
Stubbs, Amber; Uzuner, Özlem (2015) Annotating risk factors for heart disease in clinical narratives for diabetic patients. J Biomed Inform 58 Suppl:S78-91
Shivade, Chaitanya; Hebert, Courtney; Lopetegui, Marcelo et al. (2015) Textual inference for eligibility criteria resolution in clinical trials. J Biomed Inform 58 Suppl:S211-8

Showing the most recent 10 out of 57 publications