Large electronic health record research (EHR) data integrated with -omics data from linked biorepositories have expanded opportunities for precision medicine research. These integrated datasets open opportunities for developing accurate EHR-based personalized cancer risk and progression prediction models, which can be easily incorporated into clinical practice and ultimately realize the promise of precision oncology. However, efficiently and effectively using EHR for cancer research remains challenging due to practical and methodological obstacles. For example, obtaining precise event time information such as time of cancer recurrence is a major bottleneck in using EHR for precision medicine research due to the requirement of laborious medical record review and the lack of documentation. Simple estimates of the event time based on billing or procedure codes may poorly approximate the true event time. Naive use of such estimated event times can lead to highly biased estimates due to the approximation error. Such biases impose challenges to performing pragmatic trials when the study endpoint is time to events and captured using EHR. The overall goal of this proposal is to fill these methodological gaps in risk assessment for cancer research using EHR data, which will advance our ability to achieve the promise of precision oncology. Statistical algorithms and software will be developed to (i) automatically assign event time information using longitudinally recorded EHR information; and (ii) to perform accurate risk assessment using noisy proxies of event times. The proposed tools for risk assessment using imperfect EHR data without requiring extensive manual chart review could greatly improve the utility of EHR for oncology research.
This proposal addresses major methodological gaps in effectively utilizing EHR data for risk assessment due to the noisy nature of EHR data. We propose novel statistical algorithms and software to (i) efficiently annotate event time by combining multiple longitudinally recorded code information and (ii) to provide precise risk estimates using noisy proxies of event times. By drawing upon the rich albeit imperfect data from bio-repository linked EHR, our algorithms will advance our ability to use EHR data for precision oncology research.