The central goals of this proposal are: (a) To provide sharp finite sample bounds, in various matrix norms, on the accuracy of the sample covariance estimator of high dimensional covariance matrices of reduced effective rank; (b) To extend these results to functional data and characterize classes of covariance operators of reduced effective rank. To use these results to develop fully data driven methods, with strong theoretical justification, for eigenvalue and eigenvector selection, in finite samples. To apply these results to modeling vehicle emissions exhaust; (c) To study factor models of high dimensional correlation matrices of elliptical copulas. To obtain minimax estimators of these matrices and to use these results in classification problems in breast cancer data. There are interesting connections between our proposed research and existing results on estimation of covariance or correlation matrices under sparsity constrains. However, estimation under the existing sparsity types (entry-wise, row-wise, off-diagonal decay) cannot be used for modeling general types of dependency. The proposed work bridges this gap, and poses different mathematical and computational challenges.

Modeling high dimensional data and evaluating their variability presents increasing challenges in many scientific disciplines. For instance, such challenges occur in modeling network data in genetics and molecular biology; high dimensional portfolios in economics; and samples of curves in psychology, public health, transportation and urban planning. Substantially better solutions can be provided whenever the data is generated by a model with low dimensional structure. In the statistical problem of high dimensional covariance and correlation matrix estimation, this proposal will formulate the relevant notion of low dimensional structure (for instance, low effective rank or approximate low dimensional factor models). The need for a systematic investigation of various classes of covariance matrices in high dimensional models, especially in functional data settings, only begun to be recognized in recent years. This proposal is therefore a timely addition to the currently limited battery of methods and theoretical results in this important area. The usefulness of these techniques will be demonstrated by applications to data from genomics, proteomics and environmental engineering. Free software that implements the developed methodology will be made available on the web in a readily implementable form.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1310119
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2013-07-01
Budget End
2016-06-30
Support Year
Fiscal Year
2013
Total Cost
$200,000
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850