This application proposes a class of novel constrained group selection methods in high-dimensional models when there are natural constraints on the parameters. The proposed project is expected to stimulate new research directions for studying several important statistical modeling and analysis problems, which include structure estimation and variable selection in semiparametric additive models, varying coefficient models, and survival analysis models in high-dimensional settings where the number of variables is larger than the sample size. The proposed project will also yield new methods for integrative analysis of multiple genomic datasets and genome wide association studies. Theoretical properties of the proposed methods in high-dimensional settings and computational algorithms will be developed. Analysis of high-dimensional data presents new and challenging theoretical and computational questions in statistics. Standard methods assuming the number of variables is fixed and much smaller than the sample size are not applicable to high-dimensional models. The proposed methods are expected to be able to correctly select the important groups and correctly estimate model structures with high probability in sparse, high-dimensional settings.

High-dimensional data arise in many diverse fields of sciences and humanities, including biology, economics, finance, information technology, and health sciences. In all these fields, feature selection is a crucial step in the process of knowledge discovery from data. In genetic and genomic research, with rapid advances in biotechnology, more and more big data sets are being generated. The identification of statistically and biologically significant patterns from high-dimensional and noisy data sets is becoming a major challenge. The development of statistical methods that can deal with high-dimensional problems in estimating the relationship between clinical outcomes and genetic data will contribute to better understanding of the genetic basis of diseases, better diagnoses, and better survival prediction. The proposed methods will be applied to the analysis of high-dimensional censored survival data, longitudinal data, genome wide association studies (GWAS) and integrative analysis of multiple genomic datasets. Censored and longitudinal data arise in many clinical and biomedical studies. GWAS and integrative analysis are important methods for identifying disease susceptibility genes for common and complex diseases. The ultimate goal of clinical and genetic research is to understand the relationships between risk factors and phenotypes for developing new approaches to prevention, diagnosis and treatment of disease. This project aims to translate novel statistical approaches into new methodologies for analyzing high-dimensional clinical and genomic data that are important in achieving this goal. The methods and results from the proposed project will be incorporated into a graduate course on high-dimensional data analysis. The investigator will broadly disseminate the results to the scientific community by submitting papers to scientific journals and making them and the computer programs publicly available on the internet. The investigator will also present the results in scientific conferences and workshops.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1208225
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2012-07-01
Budget End
2015-06-30
Support Year
Fiscal Year
2012
Total Cost
$159,681
Indirect Cost
Name
University of Iowa
Department
Type
DUNS #
City
Iowa City
State
IA
Country
United States
Zip Code
52242