In this project we develop new methods for extracting important information from electronic health records based on recurrent neural networks. These methods represent the hierarchical and sequential nature of human language, leverage large scale datasets to make learning sophisticated representations possible, and make use of novel sources of supervision that are available at this scale. The model architecture we propose is a hierarchical recurrent neural network (RNN). This architecture explicitly represents temporality at multiple different time scales, with stacked RNN layers representing words, sentences, paragraphs, and documents. At the word level, the model is trained to predict important pieces of clinical information, such as negation and temporality, using existing labeled data sets. Training for clinical information extraction at the lowest level ensures that the higher-level models have a foundation of medically relevant inputs. We are still left with the challenge of training higher-level networks, because these models require massive amounts of labeled training data to learn. We solve this problem by taking advantage of the temporal aspect of information in an EHR, and having each higher-level recurrent layer train getting supervision from the future. For example, the document RNN is trained to predict billing codes and NLP concept codes that were found in the subsequent document. This source of supervision is scalable, and our preliminary data shows that it is effective at learning how to generate generalizable patient representations. The patient representations that our model learns are shareable across multiple tasks, potentially streamlining EHR-based research by eliminating what was previously a manual step ? designing text-based variables to represent patients. We demonstrate a new workflow for text-based EHR research, showing how the same representations can be used for two completely distinct phenotyping tasks. These phenotyping studies make use of high-quality datasets of patients with pulmonary hypertension and autism spectrum disorder at Boston Children?s Hospital. PH is relatively rare, so finding every patient with a phenotyping algorithm is important for clinical research. ASD has several sub-phenotypes, and finding large numbers of patients from each sub- phenotype can help to better understand the mechanisms of ASD. Along with demonstrating the applicability of our representations on these specific clinical research use cases, we incorporate our patient representations into the i2b2 clinical research software, making them available to all clinical investigators using this platform at Boston Children?s Hospital.

Public Health Relevance

This project develops methods for extracting universal patient representations from unstructured text in electronic health records. These methods leverage huge amounts of clinical data, recurrent neural network architectures, and novel training techniques to incorporate information at multiple time scales. These methods are evaluated using public datasets to promote reproducibility, and applied to clinical research tasks that extend the knowledge of patients with pulmonary hypertension and autism spectrum disorder at Boston Children?s Hospital.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Boston Children's Hospital
United States
Zip Code