A large amount of data is now easily accessible in real-time in a streaming fashion: news, traffic, temperature or other physical measurements sent by sensors on cell phones. Applying statistical and machine learning methods to these streaming data sets represents tremendous opportunities for a better real-time understanding of complex physical, social or economic phenomena. These algorithms could be used, for example, to understand trends in how news media cover certain topics, and how these trends evolve over time, or to track incidents in transportation networks.

Unfortunately, most algorithms for large-scale data analysis are not designed for streaming data; typically, adding data points (representing, say, today's batch of news articles from the Associated Press) requires re-solving the entire problem. In addition, many of these algorithms require the whole data set under consideration to be stored in one place. These constraints make classical methods impractical for modern, live data sets.

This project's focus is on optimization algorithms designed to work in online mode, allowing for faster, possibly real-time, updating of solutions when new data or constraints are added to the problem. Efficient online algorithms are currently known for just a few special cases. Using homotopy methods and related ideas, this work will seek to allow online updating for a host of modern data analysis problems. A special emphasis will be put on problems involving sparsity or grouping constraints; such constraints are important for example to understand how a few key features in the data set that explain most of the changes in the data. These new online algorithms will be amenable to distributed implementations to allow for parts of the data to be stored on different servers.

These methods will be applied to streaming news data coming from major US media, and also to the problem of online detection, which arises when tracking some important signal over, say, a communication network, in an online fashion.

Project Report

This project focussed on the solution of large optimization problems that arise in machine learning and more generally "big data" applications. These problems arise in a variety pf applications: deciding for example, which few features determine health outcomes for patients in an emergency unit in a hospital, or the likelihood of bankrupcy of businesses; or which keywords make an email spam or not spam. In this age of big data, such problems become very large. This project has introduced novel methods that allow one to drastically reduce the size of the problem with a computationally fast, but rigorous test. We have also examined the case when the data comes in as a stream, and demonstrated how to re-use previous computations. Again, the goal was to speed up ad scale up. In our driving application, which is text analytics, we are facing very large collections of text documents, in the millions of scientific articles for example; and we would like to analyze how a certain topic (e.g., "cervical cancer") is treated, by discovering which keywords "trigger" the appearance of that topic in any unit of the texts. We have developed a web-based text analytics platform, StatNews, that displays the result of our analyses in real-time, in a visually intuitive way. The technology developed in this project is what made possible for us to solve very large problems in real-time.

Project Start
Project End
Budget Start
2010-06-01
Budget End
2013-05-31
Support Year
Fiscal Year
2009
Total Cost
$250,000
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704