The project aims to develop effective penalization methods for screening, dimension reduction, and variable selection in high dimensional regression. The investigators focus mainly on multiple index models, because this type of models combines the strengths of linear and nonparametric regression while avoiding their drawbacks. A novel penalization approach is employed for model fitting, which regularizes both the parametric and nonparametric components of a multiple index model. A pilot study shows that this approach is more advantageous than other existing ones. When facing ultra-high dimensionality, the investigators use a forward variable screening procedure to reduce the dimension to a manageable size before applying the proposed penalization. The investigators plan to study the theoretical properties of this approach and develop fast and efficient computing algorithms for its implementation. The proposed approach is further extended to applications involving categorical responses or random effects.

Advances in science and technology have led to an explosive growth of massive data across a variety of areas such as bioinformatics, climate research, internet, etc. Traditional statistical methods for clustering, regression and classification become ineffective when dealing with a large number of variables. Lately, a tremendous amount of research effort has been dedicated to the development of statistical methods such as dimension reduction and variable selection for analyzing this type of massive data. The investigators join the effort by proposing a novel penalization approach and developing efficient computing algorithms. The results from this project not only advance statistical research but also help other scientists and researchers better understand and analyze their massive data and hence enhance their scientific discovery.

Project Report

This research project focuses on the development of penalty-based methods for dimension reduction and variable selection in high dimensional regression analysis. In particular, it focuses on developing methods under the assumption of single and multiple index models. Index models form a special family of semiparametric models, because they all consist of a parametric component (i.e. indices) and a lower dimensional nonparametric component (i.e. link functions). They can be considered compromises between linear regression models and fully nonparametric models. Index models are used to model the relationship between a response variable and a vector of explanatory variables, and are often used to facilitate dimension reduction and variable selection. When the number of explanatory variables is large, fitting index models can become challenging due to the curse of dimensionality, and therefore, penalty-based regularization methods need to be used to make fitting index models efficient and stable in high dimensions. In this project, absolute value penalty (i.e. lasso-type penalty) functions are used for index model regularization. For multiple index models, instead of penalizing only indices, a novel penalty function that penalizes both gradients and indices have be proposed, and algorithms under this penalty function have been developed. It has be shown that the proposed method can lead to more efficient results statistically as well as computationally. For single index models, in order to further improve computational efficiency, regression splines are used to estimate the nonparametric link function while penalizing the index using the lasso penalty function or a general lasso-type penalty function. Different constraints for identifiability have also been explored. The proposed approach leads to more efficient algorithms, and achieves better variable selection results in high dimensional spaces than other existing methods. The conditions under which the proposed methods are consistent for variable selection under single index models have also been obtained. The proposed methods have been implemented in R packages, which are made available to the public. They can be used by data analysts for high dimensional regression analysis, especially when the linear model cannot be assumed for the relationship between the response variable and the explanatory variables. This research project not only helps to advance the statistical theory and methodology for high dimensional semiparametric regression analysis but also helps to develop effective tools that benefit researchers in a variety of areas.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1107047
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2011-06-15
Budget End
2014-05-31
Support Year
Fiscal Year
2011
Total Cost
$50,000
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907