This research program is currently focused on the development of data analysis methods for the new paradigm of high-dimensional problems. The associated theoretical problems are concerned with eigenvalues of large dimensional random matrices. More precisely, three related directions seem of particular interest: 1) further our understanding of the spectral properties of the relevant random matrices; 2) make practical use of the results obtained, combined with some more classical results from random matrix theory; 3) find and contribute to area of applications where this framework is relevant. More specifically, it is now very often that statisticians are faced with ``n times p" data matrices X, for which p, is of the same order of magnitude as n, and p and n are both large. The sample covariance matrix computed from this data is of great importance to a number of applications, as it underlies widely used methods like principal components analysis. However, the theoretical results which underly the method fail to apply in the "large n, large p" setting just described. Hence, a thorough study of sample covariance matrices in this setting is needed. Eigenvalues of such large dimensional matrices are of particular interest. The largest and smallest eigenvalues of these matrices are, from the point of view of applications, particularly interesting. The aim of the study is to obtain central limit type theorems for these extreme eigenvalues and use them in Statistics for, for instance, hypothesis testing, having a notion of power, etc... A more applied part of this work concerns efficiently using results from random matrix theory - new and old - to better estimate the eigenvalues of the population covariance with the ultimate aim of better estimating the whole covariance matrix when p and n are both large.

Technological progress allows us to store and use massive amounts of data about many aspects of our daily lives. An interesting problem is to use this data to understand how certain traits depend on each other. In the stock market, we might be interested in how the behavior of one stock affects the behavior of another stock; understanding all these interrelationships leads to having a measure of the risk taken by investing in portfolios that use the corresponding stocks. Statisticians have a number of tools to deal with all these interrelationships. We can discover ways to look at the data so that, even if all interrelationships are small or weak, so each trait "should" not help us learn too much about any other trait, we might find combinations of the traits that carry enormous amounts of information. We also know what are typical values for these combinations, so we might be able to detect unusual things in the data by looking at it the right way. Those statistical techniques have very wide applications in various fields of science, ranging from climatology to genetics, image recognition etc... Thousands of research papers are published each year that use these techniques. However, the theory that underlies these statistical techniques was created in an era where massive datasets just did not exist, as they were not storable. This research project is focusing on theories and their applications that are better suited to handle our current massive datasets. The applications should allow us to see structure where the classical tools fail to see any and tell us when there is no structure when the classical tools tell us there is. We also have increasing evidence that our standard tools give us often very inaccurate results about our standard measures of risk or amount of information carried in combination of traits. It seems that risks might be underestimated and amount of information might be overestimated. Part of this research program will be dedicated to measuring how inaccurate the classical results are for large datasets and how can a more relevant theory be used for correcting these inaccuracies.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0605169
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2006-07-01
Budget End
2010-06-30
Support Year
Fiscal Year
2006
Total Cost
$240,000
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704