The adoption of electronic health records (EHR) in routine healthcare has resulted in a hugely promising source of data for public health and medical research. Because EHR include rich data on large populations at relatively low cost, many researchers have turned to observational studies using EHR as an alternative to conducting randomized studies that are often prohibitively expensive and time-consuming to perform. However, data are not collected for research purposes, and the potential for selection bias is high when analyses are restricted to patients with complete data. Standard methods to adjust for selection bias due to missing data, such as inverse probability weighting (IPW) and multiple imputation (MI), fail to address the complex nature of EHR data. Speci?cally, these methods tend to oversimplify the interplay of numerous decisions by patients, physicians, and insurers that collectively determine whether complete data is observed. One method for addressing selection bias due to missing data involves breaking down the complex process that governs whether or not a patient has complete data into a series of more manageable sub-mechanisms. This method involves characterizing the data provenance, or the process by which data appears in EHR. Statistical models can then be built for selection at each sub-mechanism to better re?ect the true data provenance. A frame- work for estimation has been developed in which IPW is used to adjust for selection at every sub-mechanism. Since MI is generally more ef?cient than IPW, strategies for 'blended analyses' will be developed that simulta- neously implement IPW and MI under the modularized speci?cation. Estimation and inferential procedures under this framework will be established, and extensions to Rubin's rules for the variance of estimators that combine results across multiply imputed datasets in this framework will be derived. IPW and MI fail to produce consistent estimates when data is missing not at random (MNAR); that is, when the probability that some covariate or outcome is measured depends on the value of the covariate itself, or other factors that are not completely measured in the EHR. Methods for sensitivity analyses will be developed to assess the extent to which estimators yielded by these methods are impacted by such unobserved data. The methods described in these aims will be applied to EHR-derived data that include long-term health out- comes among 13,000 individuals with type 2 diabetes who underwent bariatric surgery between 1997 and 2013. Speci?cally, this research will answer open questions about the ef?cacy and safety of bariatric surgery in the treatment of patients with obesity and type 2 diabetes, and will consider how rates of micro- and macrovascu- lar complications associated with diabetes differ between patients undergoing alternative surgical procedures. Robust software will be developed that provides researchers valid, practical, and user-friendly tools for the the identi?cation, characterization, and control of selection bias in EHR-based research.

Public Health Relevance

Electronic health records (EHR) include rich data on large populations over long periods of time and are available at relatively low cost, but data in EHR are not collected for research purposes. Missing data is extremely common in EHR and analyses that exclude patients on the basis of incomplete data are subject to selection bias. The focus of this proposal is the development of statistical methods to adjust for selection bias due to missing data in EHR-based research.

National Institute of Health (NIH)
National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)
Predoctoral Individual National Research Service Award (F31)
Project #
Application #
Study Section
Special Emphasis Panel (ZDK1)
Program Officer
Castle, Arthur
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code