This project is investigating the following three research problems concerning knowledge extraction and data mining from real-time stream systems:

Novel Framework for Stream Mining Systems: The project is studying how to formalize the problem of large scale distributed knowledge extraction from high volume streams by defining appropriate local and end-to-end objective functions, along with resource and delay constraints that will guide the different optimization and adaptation algorithms.

Topology construction: The project is studying methods to organize classifiers into a connected topology mapped onto a distributed infrastructure. The research starts by studying linear chains of classifiers and then seeks to extend the work to multiple chains working in parallel, and to classifier trees.

Decentralized Solutions based on Interactive Multi-Agent Learning: For large scale stream mining systems, where the classifiers are distributed across multiple nodes, the project is developing a decentralized decision framework and distributed algorithms for joint topology construction and local classifier configuration. In such distributed scenarios, optimizing the end-to-end performance requires interactive, multi-agent solutions to be deployed at each site in order to determine the effect of each classifier's decisions on the other classifiers.

Project Report

Multimedia data is now produced by more and more sources and in more and more formats: documents, email, transactions, tweets, audio files, videos, sensor readouts. With the proliferation of the sources and formats of data, the extraction of information from these data is becoming ever more challenging. Magnifying this is that the information must frequently be extracted from many distinct data streams and in real time to be of value! In this project, we have developed knowledge extraction engines that addressed the above-mentioned challenges and contributed to the foundation of a new field: multimedia stream mining. We have developed novel methods required to perform multimedia stream mining, such as constructing, managing, and adapting applications for the extraction of information in real time. Such applications are built as topologies of classifiers deployed on a set of processing nodes, which may be heterogeneous and distributed (so that the nodes can operate largely independently but still function in a coordinated way). In the first phase of this project we developed efficient methods for constructing and ordering such classifier topologies to extract knowledge in real time (e.g. answering specific queries of interest) from dynamically changing data streams. These queries are posed as a set of conjunctive classification operations. Our methods are able to adaptively construct optimal classifier topologies that minimize the end to end misclassification of data, while also minimizing the end to end processing delay across distributed stream mining systems. Developing such methods was challenging since the optimal ordering of classifiers to extract desirable knowledge from a dynamic data stream is dependent on the changing data characteristics, system resource constraints, and the performance and complexity characteristics of each classifier. We first developed centralized algorithms for joint ordering and individual classifier operating point selection. We then proposed a decentralized approach for distributed stream mining systems and used reinforcement learning methods to design a dynamic routing based order selection strategy. We demonstrated that these strategies lead to rapid convergence to an optimal chain of classifiers, while requiring minimum coordination and message exchange among the distributed classifiers. In the second phase of the project, we went a step further and developed online distributed algorithms which can learn how to construct the optimal classifier chain in order to maximize the stream mining performance based on the dynamically-changing data characteristics, but which do not require the distributed local classifiers to exchange any information. We prove that the long-term time average performance loss of the proposed distributed algorithm without message exchanges among classifiers as compared to the centralized algorithm which has available all the information tends to zero. This means that, surprisingly, our distributed algorithms without message exchange among classifiers do not incur any performance loss. This result is very useful in many practical applications where the classifiers are not co-located and may operate on different streams. In the final phase of this project, we have validated our methods for knowledge extraction from real-time data streams by applying them in a few concrete applications: real-time traffic forecasting and accident prediction, video-based object or face recognition services on mobile devices, distributed surveillance applications. With the vast availability of traffic sensors from which traffic information can be derived, numerous traffic prediction techniques have been developed in recent years (e.g. Google traffic prediction), which in turn improve route navigation, traffic regulation, urban area planning etc. One key challenge in traffic prediction is how much to rely on prediction models that are constructed using historical data in real-time traffic situations, which may differ from that of the historical data and change over time. Using the real-time stream mining and knowledge extraction techniques developed in this NSF project, our forecasting engine learns from the current traffic situation (or context) in real-time and predicts the future traffic by matching the current situation to the most effective prediction model trained using historical data. As real-time traffic arrives, the traffic context space is adaptively partitioned in order to efficiently estimate the effectiveness of each base predictor in different situations. The proposed algorithm also works effectively in scenarios where the true labels (i.e. realized traffic) are missing or become available with delay. Our experiments with real-world traffic data from Los Angeles shows that the proposed approach significantly outperforms all existing solutions for traffic forecasting such as those used by the Google traffic prediction engine. Moreover, our engine can predict traffic further in advance and based on fewer sensors. The most surprising part is that the probability of accidents happening in the next minute at a specific location can also be predicted with a relatively high accuracy. In terms of broader impact, this project led to one PhD thesis and one MSc thesis as well as to several tutorials on stream mining from Big Data at premier IEEE conferences in the field, in which our finding were disseminated.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1016081
Program Officer
M. Mimi McClure
Project Start
Project End
Budget Start
2010-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$299,890
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095