This project studies penalized methods for variable selection and estimation in high-dimensional models. A general approach for fitting high-dimensional models is to use regularization penalties. Several important penalized methods for variable selection and estimation have been proposed, but the properties of these methods have not been systematically studied. To apply the methods in scientific investigations, it is important to understand their properties. In particular, it is important to know under what conditions, the methods correctly select the important variables and estimate their effects in an efficient way. Standard methods for evaluating a statistical procedure assume that the number of variables in a model is fixed and much smaller than the sample size. This formulation is not applicable to high-dimensional models. The problem of analyzing high-dimensional models presents novel and challenging theoretical questions in mathematical statistics. Current variable selection methods using penalties assume a known form of the statistical model, which can be a misrepresentation of the reality. It is important to investigate what happens if a parametric model is misspecified or if no parametric assumptions are made about the model. In particular, it is important to know whether there are conditions under which penalized methods select variables correctly despite misspecification and under what conditions misspecification causes them to yield misleading results. It is also important to extend the penalized methods to nonparametric and semiparametric models.

High-dimensional data arise in many important applications, notably biological and biomedical investigations. With rapid advances in biotechnology, more and more large data sets are being generated. The identification of statistically and biologically significant patterns from high-dimensional and noisy data sets is a major challenge. The investigators apply the proposed research to genome-wide association (GWA) analysis, detection of copy number variation (CNV), and analysis of censored survival data with gene expression profiles. GWA analysis and detection of CNV enable the identification of genes and pathways responsible for the development and progression of a disease, such as many forms of cancer. Correlating a gene expression profiles with survival is useful, because survival is perhaps the most important clinical endpoint in many cancer studies. The development of statistical methods that can deal with high-dimensional problems in estimating the relationship between clinical outcomes and genetic and genomic data contribute to better understanding of the genetic basis of diseases, better diagnoses, and better survival prediction.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0706348
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2007-08-15
Budget End
2008-07-31
Support Year
Fiscal Year
2007
Total Cost
$45,000
Indirect Cost
Name
Northwestern University at Chicago
Department
Type
DUNS #
City
Evanston
State
IL
Country
United States
Zip Code
60201