Estimation and prediction with large-scale data sets commonly arise in statistics and related fields and pose great challenges. To address these challenges, four interrelated research topics are proposed for investigation. First, the investigator proposes robust variable selection methods for heavy-tailed data in the ultra-high dimensional setting of dimensionality increasing exponentially with the sample size. To address the heavy-tailedness, regularization methods with robust losses and general penalty functions in various model settings are investigated. The risk properties of these methods are studied and the optimality of penalty function and loss function is characterized. Robust independence screening methods are also proposed and studied. Second, variable selection in high-dimensional functional regression models with functional predictors and/or functional response is investigated. Model fitting procedures are proposed and sampling properties of the proposed methods are thoroughly investigated. Third, the investigator studies the regularization parameter selection in penalized empirical risk minimization in both settings of correctly specified and misspecified models in ultra-high dimensions. The appropriate tradeoff between the model fitting and model complexity is characterized. This study also answers the question on whether conventional model selection criteria such as AIC and BIC continue to work in ultra-high dimensions. Fourth, high-dimensional classification with correlated features is extensively studied under the unified framework of thresholding classification rules, and the optimal choice of threshold that minimizes the classification error is identified. The investigator studies Gaussian classification and generalizes the methods and results to the case of correlated discrete features.

Thanks to the advent of modern technologies such as the handwritten digital recognition and single-nucleotide polymorphism (SNP) genotyping experiments, massive data sets with a large number of variables are becoming more and more common in various scientific fields such as computational biology, economics, finance, machine learning, and climatology. How to effectively analyze these data sets poses great challenges in both methodology and computation that are not present in smaller scale studies. A major goal of this proposal is to propose new or extended methodologies and investigate their sampling properties in depth and width for high-dimensional model building and model evaluation in various settings of regression and classification problems. The PI has broad research interests in many fields outside statistics such as computational biology, finance, econometrics, and machine learning. The proposed methods will be tested on real data sets and extended to these different areas. In addition, the PI plans to develop software packages to implement the proposed methods, and make them publicly available. The proposed work will benefit a broad range of scientists and researchers in various fields. The PI also plans to integrate education activities with the proposed research, such as involving minority students, undergraduate students, and graduate students in the proposed projects and incorporating cutting-edge high-dimensional statistical methods into new courses.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1150318
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2012-08-01
Budget End
2017-07-31
Support Year
Fiscal Year
2011
Total Cost
$400,000
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089