This proposal develops and studies sparse variants of classic multivariate data analysis methods. It primarily focuses on sparse principal component analysis (PCA) and the related sparse canonical correlation analysis (CCA), but also intends to explore sparse variants of methods such as correspondence analysis and discriminant analysis. The motivation for developing sparse multivariate analysis algorithms is their potential for yielding statistical results that are more interpretable and more robust than classical analyses, while giving up as little as possible in the way of statistical efficiency and expressive power. The investigators have derived a convex relaxation for sparse PCA as a large-scale semidefinite program. The proposed research first studies the theoretical and practical performance of this relaxation as well as the computational complexity involved in solving large-scale instances of the corresponding semidefinite programs. In a next step, it focuses on extending these results to the other multivariate data analysis methods cited above.

Principal Component Analysis (or PCA) is a classic statistical tool used to study experimental data with a very large number of variables (meteorological records, gene expression coefficients, the interest rate curve, social networks, etc). It is primarily used as a dimensionality reduction tool: PCA produces a reduced set of synthetic variables that captures a maximum amount of information on the data. This makes it possible to represent data sets with thousands of variables on a three dimensional graph while still capturing most of the features of the original data, thus making visualization and interpretation easier. Unfortunately, the key shortcoming of PCA is that these new synthetic variables are a weighted sum of all the original variables making their physical interpretation difficult. The proposed research will study algorithms for computing sparse PCA, i.e., computing new synthetic variables that are the weighted sum of only a few problem variables while keeping most of the features of the original data set. Sparse PCA is a hard combinatorial problem but the investigators have produced a relaxation that can be solved efficiently using recent results in convex optimization. The investigators plan to study the theoretical and practical performance of this relaxation and extend these results to other statistical methods.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0625371
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2006-09-15
Budget End
2009-08-31
Support Year
Fiscal Year
2006
Total Cost
$230,000
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704