The investigator seeks to expand the theory and application for rescaled spike and slab models, a class of Bayesian models, to address the general problem of variable selection and prediction. This will be accomplished in three distinct aims: (1) By developing theory as well as fast computational algorithms for non-orthogonal designs making using of spike and slab orthogonalization. The resulting predictor, a bagged ensemble derived using generalized ridge regression, will be shown to possess state of the art predictiveness, when one factors in interpretation over black-box prediction. Theory, in the form of finite sample arguments, will show this is due to selective shrinkage, a property whereby only truly zero coefficients are shrunk towards zero; (2) By developing general methodology for hard thresholding estimated regression coefficients; (3) By extending the rescaled spike and slab framework to include non-linear models such as generalized linear models and non-proportional survival regression models with time dependent predictors.

Intellectually, this research will enhance our understanding of model building and outcome prediction, especially in ill-determined settings when the sample size is on the order of, or dominated by, the number of predictors (variables). This type of setting is becoming all too common in scientific settings. Among applications considered will be colon cancer genomics, an important public health problem. Currently, colorectal cancer is the second leading cause of cancer mortality in the adult American population, accounting for 140,000 new cases annually and 60,000 deaths. Although widely used, it is known that the current classification scheme is highly imperfect in reflecting the actual underlying molecular determinants of colon cancer behavior. For instance, upwards of 20% of patients whose cancers metastasize to the liver are not given life saving adjuvant chemotherapy based on the current clinical staging system. Thus, there is an important need for the identification of a molecular signature that will identify tumors that metastasize. Another area of application will be long-term prediction models for predicting outcomes following coronary artery bypass surgery, a widely used surgical modality for patients with obstructive coronary artery disease. Current long-term prediction models have serious limitations which have hindered our understanding. Yet another application will be in understanding survival behavior of heart and lung transplant recipients and the role viruses play in potential dysfunction of the transplanted organs. Methodology will be complemented by development of software for fast computational solutions in high dimensional settings.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0705037
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2007-08-01
Budget End
2011-07-31
Support Year
Fiscal Year
2007
Total Cost
$159,995
Indirect Cost
Name
Cleveland Clinic Lerner
Department
Type
DUNS #
City
Cleveland
State
OH
Country
United States
Zip Code
44195