In this project the investigators will develop two novel, but related, approaches to regression model variable selection. Both procedures make use of multiple, Monte Carlo-generated, pseudo data sets to tune the controlling parameter alpha in a standard variable-selection routine (e.g., alpha = "alpha-to-enter" in forward selection). In the first approach, white noise is added to the response variable with variance in controlled multiples, m, of the full-model mean squared error (FMMSE). The variable selection process is run on the noise-enhanced data, and the selected model mean squared error (MSE) is found for each value of the selection process tuning parameter alpha. This process is repeated for many bootstrap-type replications and the average MSE is retained for each value of m. The optimal alpha is determined as the value that gives, on average, the theoretically expected mean squared error, FMMSE(1 + m), for the noise-enhanced data. This approach has wide applicability to variable selection procedures used with additive-error regression models. In the second approach phony predictors are added to the data set, and the proportion of phony variables included in the selected model for different values of the selection process tuning parameter is estimated. Then, by averaging over bootstrap-type replications, the false selection rate (FSR) for the process can be estimated for the observed data for each value of the process tuning parameter alpha. The FSR is controlled by choice of alpha. The FSR is a very understandable and meaningful quantity to control. This method is not restricted to additive-error models and thus has wider applicability than the noise-enhancement above.

Regression modeling is the most widely used statistical procedure. Statisticians have long known that the choice of predictor variables to use in a regression model is the most important component of regression analysis. Yet the identification of important predictor variables remains one of the least understood and most important open problems in statistical inference. This is true for small to moderate data sets with a handful of potential predictor variables, as well as for the huge data sets with potential predictors numbering in the thousands that are becoming more prevalent in statistical applications. In this project the investigators develop methods for identifying important predictor variables from a larger set of potential predictors. The impact of the research is as broad as the application of regression modeling. The new methods will enable researchers in all application fields to better fit regression models to data sets, both small and large. The range of applications is enormous and includes, for example, genetic microarray data, drug development data, census bureau data, financial data such as credit card transactions or loan applications, large weather and environmental data sets, and electric power usage data.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0504283
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2005-07-01
Budget End
2009-06-30
Support Year
Fiscal Year
2005
Total Cost
$300,000
Indirect Cost
Name
North Carolina State University Raleigh
Department
Type
DUNS #
City
Raleigh
State
NC
Country
United States
Zip Code
27695