Electronic health records (EHRs) are now ubiquitous in routine cancer care delivery. The large volumes of data that EHRs contain could constitute an important resource for research and quality improvement, but to date, EHRs have not fully realized this potential. Important clinical endpoints, such as disease histology, stage, response, progression, and burden, are often recorded in the EHR only in unstructured free-text form. Even when structured data are available, they may be recorded only at one point in time, such as diagnosis, and may not be as relevant later in a patient's dynamic disease trajectory. These barriers prevent scalable analysis of EHR data for even relatively straightforward research tasks, such as identification of a cohort of patients potentially eligible for clinical trials. Identifying patients for trials is an important challenge in cancer research, since under 5% of adults with cancer have historically enrolled in therapeutic trials. Tools are in development to better match patients to trials, but no such tools are both publicly available and capable of incorporating time- specific patient phenotypes generated using unstructured EHR data. Recent rapid innovation in deep learning techniques could provide novel solutions to these challenges. In ongoing work, I have found that natural language processing based on a neural network architecture can reliably extract clinically relevant oncologic endpoints from free-text radiology reports. My goal is to develop an independent research program focused on leveraging such methods to put the EHR to use at scale for discovery and improving cancer care delivery.
My specific aims are (1) to develop and validate a clinically relevant, dynamic, pre-trained cancer trajectory model by applying deep learning to integrated structured and unstructured EHR data; (2) to apply transfer learning to a pre-trained cancer trajectory model to match patients to clinical trials using EHR data and clinical trial protocols; and (3) to pilot the incorporation of cancer trajectory modeling into an institutional clinical trial matching tool. In the near term, this work will facilitate accrual to clinical trials at our institution. During the independent research portion of the proposal, it will constitute the basis for a general framework for conducting scalable cancer research using EHR data.

Public Health Relevance

Electronic health records (EHRs) are now ubiquitous in routine cancer care delivery, but their utility for research and quality improvement has been limited by a dearth of methods for integrating the unstructured data in which most key cancer outcomes are encoded within EHRs. I propose to apply recent innovations in deep learning to integrate structured and unstructured data to create a dynamic pre-trained model of cancer patients' treatment trajectories, and to apply this model to identify patients who are appropriate candidates for clinical trials at times when they are eligible. I will then evaluate the effect of trajectory modeling on clinical trial accrual as I prepare for an independent research career focused on clinical cancer data science.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Career Transition Award (K99)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Radaev, Sergey
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Dana-Farber Cancer Institute
United States
Zip Code