A large amount of data is now easily accessible in real-time in a streaming fashion: news, traffic, temperature or other physical measurements sent by sensors on cell phones. Applying statistical and machine learning methods to these streaming data sets represents tremendous opportunities for a better real-time understanding of complex physical, social or economic phenomena. These algorithms could be used, for example, to understand trends in how news media cover certain topics, and how these trends evolve over time, or to track incidents in transportation networks.

Unfortunately, most algorithms for large-scale data analysis are not designed for streaming data; typically, adding data points (representing, say, today's batch of news articles from the Associated Press) requires re-solving the entire problem. In addition, many of these algorithms require the whole data set under consideration to be stored in one place. These constraints make classical methods impractical for modern, live data sets.

This project's focus is on optimization algorithms designed to work in online mode, allowing for faster, possibly real-time, updating of solutions when new data or constraints are added to the problem. Efficient online algorithms are currently known for just a few special cases. Using homotopy methods and related ideas, this work will seek to allow online updating for a host of modern data analysis problems. A special emphasis will be put on problems involving sparsity or grouping constraints; such constraints are important for example to understand how a few key features in the data set that explain most of the changes in the data. These new online algorithms will be amenable to distributed implementations to allow for parts of the data to be stored on different servers.

These methods will be applied to streaming news data coming from major US media, and also to the problem of online detection, which arises when tracking some important signal over, say, a communication network, in an online fashion.

Project Report

This project is related to grant DMI-0969923, which is focussed on the solution of large optimization problems that arise in machine learning and more generally "big data" applications. In this part of the grant, we have demonstrated the use of the technology developed in grant DMI-0969923 for a real-time text analytics platform called StatNews. This platform allows social scientists to analyze how a certain topic (e.g., "cervical cancer") is treated in a given corpora (news articles, scientific literature, etc). It works by discovering which keywords are statistically strong "triggers" of the appearance of that topic in any unit of the texts. We successfully implemented the resulfs of DMI-0969923 into the platform, resulting in a real-time display of results. At the heart of the technology there is a so-called "sparse learning" model that is based on ordinary linear regression, with an additional term that encourages sparsity. For example, in ordinary regression to predict weather, or a price, or the appearance of a given keyword, one typically does not care about which features are used in the prediction (is it yesterday's temperature? the pressure last week? Etc). In contrast, sparse learning seeks not only to make a good prediction, but to pinpoint the *few* features that are important in a good predictor. This has applications in text analytics, as the model allows one to discover the *few* keywords that trigger the appearance of a given topic in documents.

Project Start
Project End
Budget Start
2011-08-31
Budget End
2013-05-31
Support Year
Fiscal Year
2012
Total Cost
$150,473
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94710