In most areas of human knowledge, the information revolution has resulted in a massive data explosion. The scale of data sets, their distributed and heterogeneous nature, and the need to quickly deliver results easily interpretable by non-experts, raise new theoretical and computational challenges for statistical learning, from variable selection and structural inference to visualization and online learning. This project uses sparse statistical inference as a powerful approach to meet these challenges. Its key insight is that seeking sparsity is a meaningful way of simultaneously stabilizing inference procedures, and highlighting structure in the underlying data. This work thus combines fundamental advances in sparse statistical learning with cutting-edge computational tools from mathematical programming to create a new framework for structural knowledge discovery in large-scale, streaming data sets. It is focused on two fundamental themes in sparse inference: variable selection and structural inference. Variable selection seeks to isolate a few key variables from high dimensional data sets and is a fundamental preprocessing tool in statistical learning. Structural inference then aims to consistently identify a few core dependence relationships among these variables to highlight its structure. From a computational point of view, many recent results in machine learning have relied on advanced methods from convex optimization such as semidefinite programming and robust optimization and this project seeks to improve the complexity of these algorithms and their capacity to handle very-large scale, streaming data. In practice, this project is motivated by the desire to help the public understand our democracies by analyzing large-scale political and social data sets, with a particular focus on voting records, online news sources, and polling data. Its approach is to apply statistical inference principles to social sciences, using collaborations with experts in political science and economics to forge the models and techniques under study. In carrying out the research, this project will be training graduate students from statistics, electrical engineering/financial engineering into interdisciplinary researchers at the interface of statistics, optimization, and subject matter areas such as finance and political science. In addition we plan to develop a web site, accessible first to a restricted set of social science researchers, to allow them to analyze mid-sized corpora of online news in text format, in the form of say, sparse graphs of words showing statistical associations between given keywords. The PIs plan to develop a software toolbox implementing these results, interfaced with common numerical packages such as MATLAB, R or python as well as an undergraduate course on ``Statistical Analysis of Online Data" at Berkeley and Princeton, incorporating some of the material produced in this project into the course program.

Agency
National Science Foundation (NSF)
Institute
Division of Social and Economic Sciences (SES)
Type
Standard Grant (Standard)
Application #
0835550
Program Officer
Cheryl L. Eavey
Project Start
Project End
Budget Start
2008-09-15
Budget End
2013-02-28
Support Year
Fiscal Year
2008
Total Cost
$417,179
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08540