Major challenges in modern data-rich environment require new statistical algorithms that succeed under realistic scenarios and model assumptions, such as estimation in the distributed setting, ability to handle heavy-tailed data, outliers, and missing observations. Research that will be performed by the Principal Investigator (PI) in the course of this project focuses on two important problems faced by contemporary statistical science: scalability and robustness. The goal of the project is to advance our understanding of statistical techniques that involve (a) high-dimension covariance matrix estimation, and (b) distributed statistical estimation protocols. Obtained results will be of interest to scientists working on theory as well as applications.

One part of this project aims at answering open questions related to high-dimensional covariance matrix estimation for the heavy-tailed distributions. Such distributions serve as a viable model for data corrupted with outliers, an almost inevitable scenario in applications. Covariance matrix is one of the most fundamental objects in high-dimensional data analysis: many important statistical tools, such as Principal Component Analysis (PCA) and regression analysis, involve covariance estimation as a crucial step. For instance, PCA has striking connections to nonlinear dimension reduction and manifold learning techniques, genetics, computational biology, among many others. However, the assumptions underlying the theoretical analysis of most existing estimators, such as various modifications of the sample covariance matrix, are often restrictive and do not hold for real-world scenarios. Using tools from the random matrix theory, the PI will develop a new class of robust estimators that are numerically tractable, show good practical performance and enjoy strong theoretical guarantees under much weaker conditions than currently available alternatives. Specifically, the goal of the project is to design estimators that admit tight concentration around the unknown "true" covariance matrix under weak assumptions on the underlying distribution, such as existence of moments of only low order. Another part of this project is devoted to novel algorithms for scalable estimation that can take advantage of the "divide and conquer" approach. Divide and conquer paradigm assumes that data is stored and analyzed in a distributed way by a cluster consisting of several machines: each of the machines in a cluster works on its own sub-sample while communication among different machines is limited, and final results are obtained by piecing the outcomes of these distributed computations together. The PI will develop a class of new divide and conquer strategies supported by strong theoretical evidence. The project will investigate connections between the distributed estimation strategies and robustness of resulting algorithms -- an important characteristic of large distributed systems.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1712956
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-09-01
Budget End
2020-08-31
Support Year
Fiscal Year
2017
Total Cost
$99,985
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089