Large datasets are becoming increasingly available and important in science and technology. This project will develop new tools for dealing with large datasets, as well as a new understanding of some of the fascinating phenomena that emerge with high-dimensional data. This project will develop methods for recovering signals (vectors and matrices) from highly undersampled measurements (also known as compressed sensing and matrix completion), and for recovering low-rank matrices from noisy and under sampled measurements, and tools for robustly fitting predictive models when the number of predictor variables is very large. All of these tools have broad domains of applicability -- basically wherever big data are being gathered, researchers will want to use such tools. Phenomena that will be explored include the phase transitions that some algorithms undergo, going abruptly from successful recovery to failure, as the amount of undersampling and/or contamination of the data increases, and the fact that fundamental formulas of classical statistics, such as the Fisher information formula for variance of the maximum-likelihood estimator, no longer apply in high-dimensional statistics. We expect to develop quantitatively precise explanations of these phenomena. Our quantitative explanations will help engineers and scientists plan experiments and make reliable inferences from large datasets.

Several classical problems in multivariate data analysis develop a new character when the number of variables p and the number of observations n are both large. These problems include estimation in the linear model, robust estimation in the linear model, sparsity-penalized estimation in the linear model, and estimation of covariance matrices obeying a factor model. In particular, work in a range of fields shows that if certain underlying features of the problem are random (such as iid Gaussian predictor variables, or uniformly distributed eigenvectors), then various surprises occur in the limit when p and n tend to infinity in a fixed proportion. These surprises include sharp phase transitions in the success/failure of recovering the object of interest and extra gaussian noise beyond that caused by the measurements which have no classical counterpart in the p fixed n tending to infinity case. In the proposed work, we will use both massive computational experiments and precise theoretical calculations, to predict and verify such surprising phenomena in the large n, large p setting. Our techniques involve (on the computing side) new tools for design and execution of computational experiments involving many processors in cloud-based configurations, as well as (on the theoretical side) analysis tools for understanding phase transitions in compressed sensing, both in exact reconstruction from noiseless measurements and in asymptotic mean-squared error of convex optimization reconstructions as well as asymptotic mean-squared error of non-convex optimizations. In the study of low-rank models for matrices, we will develop and adapt recent advances in random matrix theory, such as recent progress in the so-called spiked covariance model.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1418362
Program Officer
Christopher Stark
Project Start
Project End
Budget Start
2014-08-15
Budget End
2018-07-31
Support Year
Fiscal Year
2014
Total Cost
$700,594
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305