With high-dimensional data parsimonious models are preferred because they are much more interpretable and at the same time reduce prediction errors. Regularization is also an essential component in most modern developments for data analysis, in particular when the number of predictors is large. Non-regularized fitting is guaranteed to give badly over-fitted and useless models. The investigators take a regularization approach to the variable selection problem in high-dimensional statistical modeling such that the resulting model enjoys excellent prediction accuracy and at the same time has a sparse representation. In particular, the investigators develop: (1) new fused variable selection methods in proteomics data analysis which has been a revolutionary cancer diagnostic tool; (2) a novel kernel logistic regression model which automatically adopts a support-vector representation; (3) several new techniques for performing simultaneous variable selection in estimating multiple quantile regression functions. The investigators also study the theory of these new variable selection techniques. Efficient algorithms and software are developed for public use.

Modern scientific innovations allow scientists to collect massive and high-dimensional data. It is critical in scientific investigations to extract useful information from the huge amount of data. For this reason, variable selection and dimension reduction play a fundamental role in high-dimensional statistical modeling. Variable selection problems arise from a wide range of fields, machine learning, drug discovery, biomarker finding, genetics, proteomics, brain imaging analysis, financial modeling, environmental sciences, to name a few. The research project aims to develop state-of-the-art statistical tools that help researchers in various fields to analyze their data.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0706724
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2007-06-01
Budget End
2010-10-31
Support Year
Fiscal Year
2007
Total Cost
$101,953
Indirect Cost
Name
Georgia Tech Research Corporation
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30332