Automatic extraction of useful information from clinical texts enables new clinical research tasks and new technologies at the point of care. The natural language processing (NLP) systems that perform this extraction rely on supervised machine learning. The learning process uses manually labeled datasets that are limited in size and scope, and as a result, applying NLP systems to unseen datasets often results in severely degraded performance. Obtaining larger and broader datasets is unlikely due to the expense of the manual labeling process and the difficulty of sharing text data between multiple different institutions. Therefore, this project develops unsupervised domain adaptation algorithms to adapt NLP systems to new data. Domain adaptation describes the process of adapting a machine learning system to new data sources. The proposed methods are unsupervised in that they do not require manual labels for the new data. This project has three aims.
The first aim makes use of multiple existing datasets for the same task to study the differences in domains, and uses this information to develop new domain adaptation algorithms. Evaluation uses standard machine learning metrics, and analysis of performance is tightly bounded by strong baselines from below and realistic upper bounds, both based on theoretical research on machine learning generalization.
The second aim develops open source software tools to simplify the process of incorporating domain adaptation into clinical text processing workflows. This software will have input interfaces to connect to methods developed in Aim 1 and output interfaces to connect with Apache cTAKES, a widely used open- source NLP tool.
Aim 3 tests these methods in an end-to-end use case, adverse drug event (ADE) extraction on a dataset of pediatric pulmonary hypertension notes. ADE extraction relies on multiple NLP systems, so this use case is able to show how broad improvements to NLP methods can improve downstream methods.
This aim also creates new manual labels for the dataset for an end-to-end evaluation that directly measures how improvements to the NLP systems lead to improvement in ADE extraction.
Software systems that use machine learning to understand clinical text often suffer severe performance loss when they are applied to new data that looks different than the data that they originally learned from. In this project, we develop and implement methods that allow these systems to automatically adapt to the characteristics of a new data source. We evaluate these methods on the clinical research task of adverse drug event detection, which relies on many different variables found in the text of electronic health records.