State-of-the-art data stream clustering algorithms developed by the data mining community do not utilize the temporal order of events and therefore in the resulting clustering all temporal information is lost. This is quite strange as one of the salient features of data streams is temporal ordering of events. In this project we develop a technique to efficiently incorporate temporal ordering into the clustering process and prove its usefulness on large, high-throughput data streams. Temporal ordering is introduced into the data stream clustering process by dynamically constructing an evolving Markov Chain where the states represent clusters. Our approach is based on the previously developed Extensible Markov Model (EMM). The results of this project will provide a framework upon which important stream mining applications such as anomaly detection and prediction of future events are easily implemented.

By showing that state-of-the-art data steam clustering algorithms can incorporate temporal order information efficiently, this project will have a broad impact on many areas where temporal order is essential. As examples, NOAA Hurricane Data and NASA satellite data will be used throughout this project. Results, including open source software will be distributed via the project Web site (http://lyle.smu.edu/ida/tracds).

Project Report

Data collection capabilities have improved dramatically over the last few decades resulting in the constant creation of enormous amounts of data. Much of the data is created as a data stream which is collected continuously over time. Examples are sales data from POS systems, web click-stream data, social media data including tweets, GPS data from mobile devices, remote sensing data in earth sciences, biomedical sensor data in medical applications, data from experiments in high-energy physics and environmental engineering, and data resulting from large-scale simulations in many areas of engineering and the sciences. The order in which events happen and thus data is collected is often crucial. While methods for modeling order in small non-streaming data are well developed, modeling the order structure of massive data streams is still in its infancy. In this project we have developed the first modeling method based on scalable data stream clustering and dynamically updated Markov models. We applied the method to the notoriously difficult problem of hurricane intensity (i.e., wind speed) prediction. Our new approach, called Prediction Intensity Interval model for Hurricanes (PIIH), dynamically models hurricane life cycle behavior and applies these models to predict hurricane intensities up to 5 days in advance. What is completely new with this approach is the fact that it also provides potential ranges (high to low) of maximum wind speed prediction with an indication of how likely wind speeds in different ranges will occur. This is significant because every year, hurricanes in the United States cause great human and economic losses. Improving intensity prediction and providing an indication of how close this estimate will be to the real value will help to improve hurricane readiness and thus reduce the risk to property and human life. In addition to improving hurricane readiness, PIIH’s behavioral models can also be used to study hurricane behavior and improve the understanding of the complex relationship between different aspects of a tropical storm which drive storm intensity. The PIIH live prediction web site developed as part of this project is publicly available at http://ida.lyle.smu.edu/PIIH/ (see also image).

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0948893
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2009-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$212,000
Indirect Cost
Name
Southern Methodist University
Department
Type
DUNS #
City
Dallas
State
TX
Country
United States
Zip Code
75205