Recent advances in science and technology have led to the generation of massive amounts of large-scale data with complex structures, including genomics, neuroimaging, and microbiology data. These large-scale datasets pose significant statistical and computational challenges to data analysis. Firstly, widely used statistical methods yield unstable estimates and are not computationally scalable to modeling large-scale data sets. Secondly, complex data sets are often accompanied by outliers due to possibly measurement error or heavy-tailed random noise. For instance, in genomic studies, it has been observed that the distribution of gene expression levels is generally heavy-tailed, that is, the data contain a lot of extremely large values. Classical statistical methods will yield biased estimates and spurious scientific discovery if these outliers are not taken into account during model estimation and inference. This project aims to develop scalable and robust multivariate statistical methods to address the aforementioned problems.

In this project, the investigator uses a combination of regularization and statistical optimization techniques to develop novel multivariate statistical methods for analyzing complex high-dimensional data sets. The first part of the project concerns the sparse generalized eigenvalue problem, which arises naturally in many statistical models such as partial least squares, canonical correlation analysis, sufficient dimension reduction, and Fisher's discriminant analysis. The investigator will develop a general framework for solving the sparse generalized eigenvalue problem and make available a wide range of statistical models for analyzing high-dimensional data. Furthermore, the investigator will study the theoretical properties of sparse generalized eigenvalue problem, and this will lead to the understanding of various statistical models that are previously not well understood in the high-dimensional setting. The second part of the research project focuses on a class of robust sparse reduced rank regression models. The investigator will develop efficient algorithms and high-dimensional asymptotic analysis for the resulting estimators under the Huber loss function, and quantify the bias-robust tradeoff between using Huber loss and squared error loss. This research project will also deliver easy-to-use software packages for fitting the developed methods.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1949730
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2019-08-01
Budget End
2021-07-31
Support Year
Fiscal Year
2019
Total Cost
$75,028
Indirect Cost
Name
Regents of the University of Michigan - Ann Arbor
Department
Type
DUNS #
City
Ann Arbor
State
MI
Country
United States
Zip Code
48109