Automated domain adaptation for clinical natural language processing

Miller, Timothy

Abstract

Automatic extraction of useful information from clinical texts enables new clinical research tasks and new technologies at the point of care. The natural language processing (NLP) systems that perform this extraction rely on supervised machine learning. The learning process uses manually labeled datasets that are limited in size and scope, and as a result, applying NLP systems to unseen datasets often results in severely degraded performance. Obtaining larger and broader datasets is unlikely due to the expense of the manual labeling process and the difficulty of sharing text data between multiple different institutions. Therefore, this project develops unsupervised domain adaptation algorithms to adapt NLP systems to new data. Domain adaptation describes the process of adapting a machine learning system to new data sources. The proposed methods are unsupervised in that they do not require manual labels for the new data. This project has three aims.
The first aim makes use of multiple existing datasets for the same task to study the differences in domains, and uses this information to develop new domain adaptation algorithms. Evaluation uses standard machine learning metrics, and analysis of performance is tightly bounded by strong baselines from below and realistic upper bounds, both based on theoretical research on machine learning generalization.
The second aim develops open source software tools to simplify the process of incorporating domain adaptation into clinical text processing workflows. This software will have input interfaces to connect to methods developed in Aim 1 and output interfaces to connect with Apache cTAKES, a widely used open- source NLP tool.
Aim 3 tests these methods in an end-to-end use case, adverse drug event (ADE) extraction on a dataset of pediatric pulmonary hypertension notes. ADE extraction relies on multiple NLP systems, so this use case is able to show how broad improvements to NLP methods can improve downstream methods.
This aim also creates new manual labels for the dataset for an end-to-end evaluation that directly measures how improvements to the NLP systems lead to improvement in ADE extraction.

Public Health Relevance

Software systems that use machine learning to understand clinical text often suffer severe performance loss when they are applied to new data that looks different than the data that they originally learned from. In this project, we develop and implement methods that allow these systems to automatically adapt to the characteristics of a new data source. We evaluate these methods on the clinical research task of adverse drug event detection, which relies on many different variables found in the text of electronic health records.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Research Project (R01)
Project #: 5R01LM012918-03
Application #: 9986899
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Sim, Hua-Chuan

Project Start: 2018-09-01
Project End: 2021-07-31
Budget Start: 2020-08-01
Budget End: 2021-07-31
Support Year: 3
Fiscal Year: 2020
Total Cost
Indirect Cost

Institution

Name: Boston Children's Hospital
Department
Type
DUNS #: 076593722

City: Boston
State: MA
Country: United States
Zip Code: 02115

Related projects


NIH 2020 R01 LM	Automated domain adaptation for clinical natural language processing Miller, Timothy A. / Boston Children's Hospital
NIH 2019 R01 LM	Automated domain adaptation for clinical natural language processing Miller, Timothy A. / Boston Children's Hospital
NIH 2018 R01 LM	Automated domain adaptation for clinical natural language processing Miller, Timothy A. / Boston Children's Hospital

Comments

Be the first to comment on Timothy Miller's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: