This proposal aims at answering pertinent questions to identifying sparse subsets of high dimensional covariate spaces, in the context of regularization methods, when simple least square loss functions are not well suited. The broad goal is to understand the fundamental interactions between nonlinear and/or censored structure of the statistical model, the regularization scheme and the intrinsic dimensionality of the problem. Specifically, the PI aims at (1) Identifying new statistical problems and regularization schemes, with complex nonlinear and time-to-event structure with high dimensional covariate space that are tuned to the characteristics of the data; (2) Developing novel non-asymptotic oracle bounds on the behavior of regularized estimators where techniques of random matrix theory, especially high probability bounds on various matrix norms, and approximation theory, will be utilized to enhance understanding of the effects of dimensionality on the non-asymptotic properties; (3) Investigating new non-asymptotic bounds on risk of semiparametric methods where the statistical model is possibly misspecified; (4) Analyzing and developing models that use special interplay between censoring rate, sample size and dimensionality of the problem and importantly (5) Introducing new algorithms that optimally and efficiently solve the investigated large scale problems.

Explosion of microarray technologies has lead to vast number of large-scale genome-wide association studies where simultaneous analysis of a large number of SNPs is pertinent to discovering genetic identification of complex diseases. Presence and importance of time to event component calls for significant advances in statistical methodology for both NP dimensionality and censored structure. This research proposal aims at developing innovative and effective statistical methods for such complex data with special impact in genetic, public health and bioinformatic sciences, where censoring and vast number of gene interactions make identification of misbehaving genes very difficult. Moreover, the developments of this inter-disciplinary project will enhance new scientific discoveries, make new collaborative connections with practitioners and will promote teaching and training of graduate students on the contemporary state-of-the-art machine learning techniques applied to semiparametric models and censored data. To promote the progress of science, PI will make explicit collaborations of department of Mathematics, Biostatistics division of the Medical School at UCSD and Supercomputer Center in San Diego. Through dissemination of the results of this proposal, PI plans to expose biology to mathematics majors and promote science among underrepresented groups and women in mathematics.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1205296
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2012-09-01
Budget End
2015-08-31
Support Year
Fiscal Year
2012
Total Cost
$120,000
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093