The applicant seeks to address the problem of missing values A major challenge for biomedical research comes from the problems of missing values, which may be caused by subjective (e.g., nonresponse and dropout) and technical reasons (e.g., censoring over/below quantization level). Generalized linear models (GLMs) and Generalized Linear Mixed Models (GLMMs) are popularly applied in biomedical data analysis where a fundamental task is to identify a subset of independent variables (e.g., genetic, proteomic, behavioral, or environmental factors) to interpret or predict a dependent variable (e.g., therapeutic effectiveness and safety). Given an incomplete data set, practitioners may needlessly resort to the strategy of case-deletion where individuals are excluded from consideration if they miss any of the variables targeted for analysis. This method would not only sacrifice useful information, but also give rise to biased estimates because it requires strong assumptions to accept the missingness mechanisms. A more satisfactory solution for missing data problems involves multiple imputation, where several imputations are created for the same set of missing values. Across multiply imputed data sets, however, traditional variable selection methods (based on significance tests or likelihood criteria) often result in models with different selected predictors, thus presenting a problem of combining the models to make final inferences. In this R01 proposal, we aim to develop alternative strategies of variable selection for GLMs with missing values by drawing on a Bayesian framework. One approach called "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. The second strategy - "simultaneously impute and select" (SIAS) - conducts Bayesian variable selection and missing data imputation simultaneously within one Markov Chain Monte Carlo (MCMC) process. ITS and SIAS offer two generic frameworks within which various Bayesian variable selection algorithms and missing data imputation algorithms can be implemented. The strategies will be extended to handle complex data sets such as those with multi-level design structures and/or large number of variables. The strategies will be developed, evaluated, and implemented into an R library for normal, binomial/multinomial, and Poisson regression models with mixed categorical and continuous explanatory variables. Simulated and practical data sets from studies on childhood autism and drug dependence will be used to address the effectiveness and flexibility of the proposed strategies.

Public Health Relevance

Missing data is the normal circumstance when developing large data sets. This issue comes to the forefront when using large data sets to develop personalized and individualized care. To avoid this loss of data and provide better predictions of risk and benefit, imputation-based Bayesian variable selection strategy provides a powerful analytical tool. The availability of our new method and software package will greatly enhance the capacity and quality of medical research and healthcare delivery

National Institute of Health (NIH)
Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-AARR-F (52))
Program Officer
King, Rosalind B
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Davis
Public Health & Prev Medicine
Schools of Medicine
United States
Zip Code
Kim, Soeun; Sugar, Catherine A; Belin, Thomas R (2015) Evaluating model-based imputation methods for missing covariates in regression models with interactions. Stat Med 34:1876-88
Zhang, Xiaoshuai; Xue, Fuzhong; Liu, Hong et al. (2014) Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies. BMC Genet 15:130
Peng, Bin; Zhu, Dianwen; Ander, Bradley P et al. (2013) An integrative framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLoS One 8:e67672
Zhang, Xiaoshuai; Yang, Xiaowei; Yuan, Zhongshang et al. (2013) A PLSPM-based test statistic for detecting gene-gene co-association in genome-wide association study with case-control design. PLoS One 8:e62129