Large existing healthcare databases, e.g., health insurance claims, Medicaid and Medicare claims, national and regional cancer registries, and electronic medical records present not only opportunities but also challenges for comparative research in various medical areas including cancer surveillance. Confounding bias exists due to the "observational" nature of these databases. Moreover, missing data commonly occur as these databases are collected for non-research purposes. Each problem has been extensively studied. But analytic approaches that tackle both issues in a unified manner are lacking. There is a critical need to develop novel statistical methods as well as software tools to bridge the gap between existing observational databases and needs in the knowledge of comparative effectiveness.
Specific Aims : We propose to develop two methods to analyze incomplete observational data. Both methods are novel applications of existing methods. The first one, multiply-robust method, will be developed based on the doubly-robust theory for causal inference and missing data models. The second one, tree-based imputation method, will integrate the multiple imputation approach with the tree-based, data-adaptive regression techniques for robust inference. We will evaluate and compare the performance of the new analytic methods via extensive simulation studies. We will also apply the methods to an existing breast cancer adherence study to compare the effect between two adjuvant hormone therapies on medication adherence rate. In addition, we propose to develop and document software programs to facilitate implementation of the proposed methods. Research Design: The new methods will be firstly developed in simple settings with missing confounders only and then be extended to more general settings with both missing confounders and missing outcomes. Throughout our methods development, we assume data are missing at random. We will consider various missing data patterns that are commonly observed in comparative studies using existing healthcare databases. Impact: The potential impact of this project is significant because the successful implementation of the proposed research will result in novel analytic methods as well as software tools to help investigators correctly and efficiently analyze existing observational databases with missing data to obtain valid comparative effectiveness and safety results. With the ongoing efforts in building nationwide electronic medical records systems, the results from analyzing these secondary databases will help address many important public health and medical questions that either, due to ethical and practical reasons, cannot be addressed by randomized clinical trials (RCTs), or require much more time and resources to address via RCTs.

Public Health Relevance

We propose to develop unified, robust approaches to analyze existing observational databases with missing confounder and/or outcome data for comparative effectiveness research. These statistical methods will address the confounding bias and missing data issues in a unified manner. Successful completion of the proposed research will produce much-needed analytic methods and software tools to maximize the use of existing healthcare databases to address important public health and medical questions.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Mariotto, Angela B
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard Pilgrim Health Care, Inc.
United States
Zip Code
Shen, Changyu; Li, Xiaochun; Li, Lingling (2014) Inverse probability weighting for covariate adjustment in randomized studies. Stat Med 33:555-68
Shen, Changyu; Jeong, Jaesik; Li, Xiaochun et al. (2013) Treatment benefit and treatment harm rate to characterize heterogeneity in treatment effect. Biometrics 69:724-31