and Abstract Narratives of electronic health records (EHRs) contain useful information that is difficult to automatically extract, index, search, or interpret. Natural language processing (NLP) technologies can extract this information and convert it in to a structured format that is more readily accessible by computerized systems. However, the development of NLP systems is contingent on access to relevant data and EHRs are notoriously difficult to obtain because of privacy reasons. Despite the recent efforts to de-identify and release narrative EHRs for research, these data are still very rare. As a result, clinical NLP, as a field has lagged behind. To address this problem, since 2006, we organized thirteen shared tasks, accompanied with workshops and journal publications. Twelve of these shared tasks have focused on the development of clinical NLP systems and the remaining one on the usability of these systems. We have covered both depth and breadth in terms of shared tasks, preparing tasks that study cutting-edge NLP problems on a variety of EHR data from multiple institutions. Our shared tasks are the longest running series of clinical NLP shared tasks, with ever growing EHR data sets, tasks, and participation. Our most popular three data sets have been cited 495 (2010 data), 284 (2006 de-id data), and 274 (2009 data) times, respectively, representing hundreds of articles that have come out of these three data sets alone. Our goal in this proposal is to continue the efforts we started in 2006 under i2b2 shared task challenges (i2b2, NIH NLM U54LM008748, PI: Kohane and R13 LM011411, PI: Uzuner) to de-identify EHRs, annotate them with gold- standard annotations for clinical NLP tasks, and release them to the research community for the development and head-to-head comparison of clinical NLP systems, for the advancement of the state of the art. Continuing our efforts under National NLP Clinical Challenges (n2c2) based at the Health Data Science program of the newly established Department of Biomedical Informatics at Harvard Medical School, we aim to form partnerships with the community to grow the shared task efforts in several ways: (1) grow the available de-identified EHR data sets through partnerships that can contribute to the volume and variety of the data, and (2) grow the available gold-standard annotations in terms of depth and breadth of NLP tasks. Given these aims and partnerships, we plan to hold a series of shared tasks. We will complement these shared tasks with workshops that meet in conjunction with the Fall Symposium of the American Medical Informatics Association and with journal special issues so that advancement of the state of the art can be sped up and future generations can build on the past.

Public Health Relevance

We propose to organize a series of shared tasks, workshops, and journal publications for fostering the continuous development of clinical Natural Language Processing (NLP) technologies that can extract information from narratives of Electronic Health Records (EHRs). Our aim is to grow the annotated gold standard EHR data sets that are available to the research community through partnerships and to bring together clinical NLP researchers with informatics researchers for building collaborations. We will engage the community in shared tasks and disseminate the knowledge generated by these shared tasks through workshops and journal special issues for the advancement of the state of the art.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Conference (R13)
Project #
5R13LM013127-02
Application #
9930152
Study Section
Special Emphasis Panel (ZLM1)
Program Officer
Sim, Hua-Chuan
Project Start
2019-05-15
Project End
2024-04-30
Budget Start
2020-05-01
Budget End
2021-04-30
Support Year
2
Fiscal Year
2020
Total Cost
Indirect Cost
Name
George Mason University
Department
Biostatistics & Other Math Sci
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
077817450
City
Fairfax
State
VA
Country
United States
Zip Code
22030