Since 2010, clinical medicine and public health have benefited from a rapid surge of clinical research on chronic diseases using data from electronic health records (EHRs). However, while millions of patient records are included in large EHR networks, they are not population-representative random samples, a constraint which has restrained their utility for population health research. The non-representative nature of patients represented in EHR data also poses a major challenge when performing cross-site validation of EHR-based findings, as study findings tend to reflect the unique characteristics of populations served by specific health care systems. We propose to perform an integrated secondary data analysis of three unique datasets: 1) the Health and Retirement Survey (HRS, begun in 1992 and ongoing) that has nationally representative health interview data for over 20 years, as well as biomarkers, physical assessment information, prescription drug data, and claims linkages including Medicare D drug claims; 2) the New York University Langone Health EHR data (NYU-CDRN, 2009 to now) including demographics, vitals, diagnoses, lab results, prescriptions and procedures; 3) the New York City Clinical Data Research Network (NYC-CDRN) which is an EHR network that comprises 20 NYC healthcare institutions, including the NYU-CDRN, with longitudinally linked data on over 12 million patient encounters under a Common Data Model; and 4) Veterans Affairs Ann Arbor Healthcare System (VAAAHS) Corporate Data Warehouse (CDW), which provides an important complement to the NYC-CDRN patient population when assessing our method?s reproducibility and generalizability for the rural patient population in care. We will leverage these four datasets to support three strands of questions on EHR-based risk predictions: 1) assessing its utility for population inference, 2) developing individualized absolute risk predictions, and 3) assessing its reproducibility and cross-site validation. We will predict risk of subsequent incident cardiovascular disease (CVD) in older patients (age 50 and older) with type 2 diabetes (T2DM). Broader use of these methods will be generally applicable to other diseases outcomes. To achieve these objectives, our study will 1) develop and validate EHR phenotyping and diagnosis time algorithms against gold standard chart review (Aim 1); 2) assess the population-generalizability of EHR-based risk estimation models by comparing with cohort-based risk estimation models and develop EHR bias adjustment methods for population inference (Aim 2); 3) develop methods for EHR-based individualized absolute risk prediction (Aim 3), and establish the developed methods via cross-site validation (Aim 4).
In this proposal, our interdisciplinary team proposes to use novel approaches and innovative combinations of data to galvanize the use of EHR networks and population based cohorts to advance EHR-based risk predictions for 1) population level inference; 2) making individualized predictions of absolute risk, and 3) improving reproducibility and cross-site validation. We will predict risk of subsequent incident cardiovascular disease (CVD) in older patients (age 50 and older) with type 2 diabetes (T2DM).