This project proposes a class of novel bi-level variable selection method in high-dimensional regression models when there is group structure in covariates. The existing variable selection methods are designed for either individual variable selection or group selection, but not for both. Furthermore, standard methods for evaluating a statistical procedure assume that the number of variables in a model is fixed and much smaller than the sample size which, in general, are not applicable for high-dimensional settings. Analysis of high-dimensional data presents novel and challenging theoretical and computational questions in statistics. The proposed methods are capable of simultaneous group and individual variable selection within selected groups. For the proposed bi-level selection methods, computational algorithms will be developed and the theoretical properties in a class of important regression models will be investigated. The proposed methods are expected to be able to correctly select the important groups and variables simultaneously with high probability in sparse models even when the number of covariates is much larger than the sample size.

High-dimensional data arise in many scientific fields, including biology, economics, finance, information technology, and health sciences. In all these fields, the identification of important features from data is a crucial step in the process of scientific discovery. The intended applications of the proposed study are to the analysis of high-dimensional genomic data. In particular, the proposed research will obtain novel methods for genome wide association studies and genetic pathway regression analysis. These are two of the most important approaches for understanding how genes and genetic pathways cause common and complex diseases such as various types of cancers. The proposed research aims to translate novel statistical approaches into new methodologies for analyzing high-dimensional genomic data.

Project Report

This project studies the problem of variable selection in high-dimensional statistical models. High-dimensional data arise in many scientific fields, including biology, economics, finance, information technology, and health sciences. In all these fields, the identification of important features from data is a crucial step in the process of scientific discovery. The methods developed in this project are broadly applicable to high-dimensional feature selection problems. In particular, they are applicable to the analysis of high-dimensional genomic data, where the goal is to identify genetic elements that cause common and complex diseases so as to contribute to better disease diagnosis, prognosis prediction and treatment selection. This project develops a class of novel bi-level variable selection methods in high-dimensional regression models when there is group structure in variables. The existing variable selection methods are designed for either individual variable selection or group selection, but not for both. The proposed methods are capable of selecting important groups as well as important groups. This property is particularly useful in the analysis of genomic data for finding genes that are related to disease, where it is important to take into account the grouping structure of genes in terms of biological pathways or functional groups. In this type of analysis, it is important to identify genes as well as biological pathways or functional groups. Novel theoretical properties of the methods developed under this project are studied and efficient computational algorithms are developed to implement the methods. In addition, this project provided training to five graduate students. Three of them, including a female student, have obtained their Ph.D. degree (one in statistics, one in biostatistics and one in applied mathematics), and the other two are working towards their Ph.D. degree in statistics and biostatistics on the topics that are generated during the course of this project.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0805670
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2008-08-01
Budget End
2011-07-31
Support Year
Fiscal Year
2008
Total Cost
$133,597
Indirect Cost
Name
University of Iowa
Department
Type
DUNS #
City
Iowa City
State
IA
Country
United States
Zip Code
52242