In conducting medical research, especially with behavioral and social problems, a challenge for statistical data analysis comes from the problems introduced by missing values. Missing values may be caused by subjective (e.g., nonresponse and dropout) and technical reasons (e.g., censoring over/below quantization level). Generalized linear models (GLMs) are popularly applied in biomedical data analysis where a fundamental task is to interpret or predict an outcome variable by a subset of potentially explanatory variables. Given an incomplete data set, practitioners frequently resort to the strategy of case-deletion where individuals are excluded from consideration if they miss any of the variables targeted for analysis. This is the default option used in many software packages. Yet, case-deletion may not only sacrifice useful information, but also give rise to biased estimates because it requires strong assumptions on the missingness mechanisms. A more satisfactory solution for missing data problems involves multiple imputation, where several imputations are created for the same set of missing values. The variance between imputations reflects the uncertainty due to missingness. Across multiply imputed data sets, however, traditional variable selection methods (based on significance tests or various criteria) often result in models with different selected predictors, thus presenting a problem of combining the models to make final inferences. In this R01 proposal with a 3-year research plan, we aim to develop two alternative strategies of variable selection for GLMs with missing values by drawing on a Bayesian framework. One approach, which we call "impute, then select" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. The second strategy - "simultaneously impute and select" (SIAS) - is to conduct Bayesian variable selection and missing data imputation simultaneously within one Markov Chain Monte Carlo (MCMC) process. ITS and SIAS offer two generic frameworks within which various Bayesian variable selection algorithms and missing data imputation algorithms can be implemented. Both strategies will be developed, evaluated, and implemented into an R library for normal regression, binomial regression, and other GLMs with categorical and/or continuous explanatory variables. Practical data sets from several studies on substances abuse and childhood autism will be used to address the effectiveness and flexibility of the proposed strategies. Development of these procedures and contribution of the software to statisticians and researchers in medical research would significantly improve the quality of evaluation of important and clinically relevant data.

Public Health Relevance

Variable selection in generalized linear models (GLMs) is a fundamental task and missing values are commonly seen in biomedical research. The proposed method of Bayesian variable selection within multiple imputation overcomes the limitation of traditional variable selection methods, especially in handling missing values. The accomplishment of the methodology and software development will provide the research society with powerful statistical tools to enhance the quality of medical research.

National Institute of Health (NIH)
Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-AARR-F (52))
Program Officer
King, Rosalind B
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Hunter College
Schools of Public Health
New York
United States
Zip Code