This research proposal consists of four closely related research thrusts, all centered around the common goal of an integrated treatment of statistical and computational issues in dealing with high-dimensional data sets arising in information technology (IT). The first two research thrusts focus on fundamental issues that arise in the design of penalty-based and other algorithmic methods for regularization. Key open problems to be addressed include the link between regularization methods and sparsity, consistency and other theoretical issues, as well as structured regularization methods for model selection. Sparse models are desirable both for scientific reasons including interpretability, and for computational reasons, such as the efficiency of performing classification or regression. The third research thrust focuses on problems of statistical inference in decentralized settings, which are of increasing importance for a broad variety of IT applications such as wireless sensor networks, computer server ``farms'', traffic monitoring systems. Designing suitable data compression schemes is the key challenge. On one hand, these schemes should respect the decentralization requirements imposed by the system (e.g., due to limited power or bandwidth of communicating data); on the other hand, they should also be (near)-optimal with respect to a statistical criterion of merit (e.g., Bayes error for a classification task; MSE for a regression or smoothing problem). The fourth project addresses statistical issues centered around the use of Markov random fields, widely-used for modeling large collections of interacting random variables, and associated variational methods for approximating moments and likelihoods in such models.

BROAD SUMMARY: The field of statistical machine learning is motivated by a broad range of problems in the information sciences, among them remote sensing, data mining and compression, and statistical signal processing. Its applications range from homeland security (e.g., detecting anomalous patterns in large data sets) to environmental monitoring and assessment (e.g., estimating changes in Arctic ice). A challenging aspect to such applications is that data sets tend to be complex, massive (frequently measured in terabytes of data), and rich in terms of possible features (hundreds of thousands to millions). These characteristics presents fundamental challenges in the design and application of statistical models and algorithms for testing hypotheses and performing estimation. Whereas classical statistical methods are designed separately from computational considerations, dealing effectively with extremely high-dimensional data sets requires that computational issues be addressed in a more integrated manner during the design and testing of statistical models; and that issues of over-fitting and regularization, while always statistically relevant, become of paramount importance.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0605165
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2006-08-15
Budget End
2010-07-31
Support Year
Fiscal Year
2006
Total Cost
$450,000
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704