A research effort is proposed to create tools for data analysis and inference in high-dimensional settings. The effort uses tools from random matrix theory (RMT), Banach Space Theory (BST), and differential geometry (DG) to expose new phenomena in high-dimensional statistical inference and data analysis, yielding practical statistical methods with rigorously-established properties under carefully-stated conditions. The results will impact a wide range of data analysis problems, including the building of linear models, the testing of complex hypotheses about multivariate data, and the detection of subtle nonlinear structures in high-dimensional data. In the research, the investigators build further bridges between RMT, BST, and DG and three problem areas: (a) Sparse Linear Modelling -- How should one build a predictive model choosing relatively few predictors out of many available predictors?; (b) Multivariate Analysis in High Dimensions -- How should one best estimate and test for structure in high-dimensional data, particularly when the number of variables is large and the number of observations is small?; (c) Manifold Learning -- How can one best find nonlinear structure in high-dimensional data and best parametrize that structure? Each of these areas is of fundamental importance to the analysis of high-dimensional data, and the investigators identify a strategy to use RMT, BST, and DG to make substantial contributions to each. This strategy builds on the authors' recent research accomplishments using RMT, BST, and DG, which will be extended to show: (a) how to find the best-fitting low-dimensional linear model without spending exponential time searching through model space -- extending previous successes in using Basis Pursuit, LARS and Lasso; (b) how to correctly test a wide range of important hypotheses in multivariate analysis using the Tracy-Widom distribution -- extending previous results in applying the Tracy-Widom distribution to Principal Components Analysis; and (c) how to correctly estimate a nonlinear parametrization of sparsely sampled curved data in high dimensional space -- extending previous successes in developing the Hessian Eigenmap technique of dimensionality reduction.

The motivation for this project lies in the `data deluge' now engulfing every branch of science and technology. In field after field, new sensors are creating data streams of unparalleled breadth and depth. As a result, today scientific and technological progress depends heavily on the ability to process high-dimensional data and reduce its dimensionality, sometimes drastically, obtaining a good approximation using a few well-chosen combinations of the original measurements. While many methods of dimensionality reduction have already been proposed, much existing research activity in this area is heuristic and speculative; the tools are often of unknown reliability and their properties hold under conditions of unknown generality. This project develops methods based on careful mathematical analysis to develop methods of dimensionality reduction which are rigorously correct and/or optimal. These methods give the user the assurance that important features are captured in the dimensions which remain and that little of importance is discarded in the dimensions that are thrown away. The project develops such rigorous methods in three areas: (a) building parsimonious but accurate predictive models out of a database of many possible predictors; (b) testing for hidden structure in what otherwise seems to be high dimensional `noise'; (c) discovering the correct representation for data which are intrinsically nonlinear. Strong expectations for success of this project can be based on existing solid achievements by the investigators in each of these three areas.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0505303
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2005-07-01
Budget End
2010-06-30
Support Year
Fiscal Year
2005
Total Cost
$799,890
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Palo Alto
State
CA
Country
United States
Zip Code
94304