In conducting medical research, especially with behavioral and social problems, a challenge for statistical data analysis comes from the problems introduced by missing values. Missing values may be caused by subjective (e.g., nonresponse and dropout) and technical reasons (e.g., censoring over/below quantization level). Generalized linear models (GLMs) are popularly applied in biomedical data analysis where a fundamental task is to interpret or predict an outcome variable by a subset of potentially explanatory variables. Given an incomplete data set, practitioners frequently resort to the strategy of case-deletion where individuals are excluded from consideration if they miss any of the variables targeted for analysis. This is the default option used in many software packages. Yet, case-deletion may not only sacrifice useful information, but also give rise to biased estimates because it requires strong assumptions on the missingness mechanisms. A more satisfactory solution for missing data problems involves multiple imputation, where several imputations are created for the same set of missing values. The variance between imputations reflects the uncertainty due to missingness. Across multiply imputed data sets, however, traditional variable selection methods (based on significance tests or various criteria) often result in models with different selected predictors, thus presenting a problem of combining the models to make final inferences. In this R01 proposal with a 3-year research plan, we aim to develop two alternative strategies of variable selection for GLMs with missing values by drawing on a Bayesian framework. One approach, which we call """"""""impute, then select"""""""" (ITS) involves initially performing multiple imputation and then applying Bayesian variable selection to the multiply imputed data sets. The second strategy - """"""""simultaneously impute and select"""""""" (SIAS) - is to conduct Bayesian variable selection and missing data imputation simultaneously within one Markov Chain Monte Carlo (MCMC) process. ITS and SIAS offer two generic frameworks within which various Bayesian variable selection algorithms and missing data imputation algorithms can be implemented. Both strategies will be developed, evaluated, and implemented into an R library for normal regression, binomial regression, and other GLMs with categorical and/or continuous explanatory variables. Practical data sets from several studies on substances abuse and childhood autism will be used to address the effectiveness and flexibility of the proposed strategies. Development of these procedures and contribution of the software to statisticians and researchers in medical research would significantly improve the quality of evaluation of important and clinically relevant data.

Public Health Relevance

Variable selection in generalized linear models (GLMs) is a fundamental task and missing values are commonly seen in biomedical research. The proposed method of Bayesian variable selection within multiple imputation overcomes the limitation of traditional variable selection methods, especially in handling missing values. The accomplishment of the methodology and software development will provide the research society with powerful statistical tools to enhance the quality of medical research.

Agency
National Institute of Health (NIH)
Institute
Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD)
Type
Research Project (R01)
Project #
7R01HD061404-03
Application #
8543193
Study Section
Special Emphasis Panel (ZRG1-AARR-F (52))
Program Officer
King, Rosalind B
Project Start
2011-08-11
Project End
2014-04-30
Budget Start
2012-08-28
Budget End
2013-04-30
Support Year
3
Fiscal Year
2012
Total Cost
$95,377
Indirect Cost
$33,039
Name
Hunter College
Department
Type
Schools of Public Health
DUNS #
620127915
City
New York
State
NY
Country
United States
Zip Code
10065
Kim, Soeun; Belin, Thomas R; Sugar, Catherine A (2016) Multiple imputation with non-additively related variables: Joint-modeling and approximations. Stat Methods Med Res :
Kim, Soeun; Sugar, Catherine A; Belin, Thomas R (2015) Evaluating model-based imputation methods for missing covariates in regression models with interactions. Stat Med 34:1876-88
Zhang, Xiaoshuai; Xue, Fuzhong; Liu, Hong et al. (2014) Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies. BMC Genet 15:130
Peng, Bin; Zhu, Dianwen; Ander, Bradley P et al. (2013) An integrative framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLoS One 8:e67672
Zhang, Xiaoshuai; Yang, Xiaowei; Yuan, Zhongshang et al. (2013) A PLSPM-based test statistic for detecting gene-gene co-association in genome-wide association study with case-control design. PLoS One 8:e62129