The goal of this project is to develop a stream processing system that captures data uncertainty from data collection to query processing to final result generation. This project focuses on data that is naturally modeled as continuous random variables. For such data, it employs a principled approach grounded in probability and statistical theory to capture data uncertainty and integrates this approach into high-volume stream processing. The first contribution of the project is to capture uncertainty of raw data streams from sensing devices. Since the raw streams may not present data in a format suitable for query processing and can be highly noisy, this project employs probabilistic models of the underlying data generation process and machine learning techniques to efficiently transform raw data into a desired representation with an uncertainty metric. The second contribution is to capture uncertainty as data propagates through various query operators. To efficiently quantify result uncertainty of a query operator, this project explores various techniques based on probability and statistical theory to reduce statistics that input streams need to carry and to expedite the computation of result distributions. This project integrates research and education through curriculum development and enables broader participation of women in research through college outreach and CRA's distributed mentor program. This project also includes software release and real-world deployments in domains such as object tracking and monitoring and hazardous weather monitoring, resulting in significant scientific and social impacts. Results of this project are disseminated at the project web site (http://avid.cs.umass.edu/projects/uncertain-streams/).
The goal of this project was to address uncertain data management that commonly arises in large-scale sensing and scientific computing applications, such as object tracking using Radio Frequency Identification (RFID) and tornado detection using weather radar streams. Research in this project led to the development of an uncertain data management system, called CLARO, that can capture data uncertainty from data collection to data processing to final result production. This research was divided into two related areas: (1) data capture and transformation, where raw noisy data streams are transformed in real time into rich, queriable tuple streams that carry necessary (sometimes new) attributes for query processing and characterize uncertainty of these attributes using continuous probability distributions; (2) complex query processing on uncertain tuple streams that carry continuous probability distributions, where the CLARO system can efficiently capture the uncertainty of query processing results on these tuples and return only the query results of high confidence as requested by the user. Results of this project significantly advanced the state of the art in uncertain data management and led to the following contributions: (1) The CLARO system is among the first to provide an integrated solution to supporting highly complex queries on noisy raw data streams and formally characterizing data uncertainty from data collection to data processing to final result output. (2) For data capture and transformation, the proposed system can transform raw data into tuples where the uncertain attributes are characterized by continuous probability distributions that are highly concentrated on the true values. The system further performs such data transformation at stream speed, achieving orders of magnitude improvement in performance and scalability over existing data cleaning and inference techniques. (3) To support complex query processing on the tuple streams that carry continuous probability distributions, the proposed system can take arbitrary user-defined accuracy requirements, and provide an efficient evaluation plan that produces query results that satisfy the accuracy requirements. The techniques employed in the system outperform best sampling methods in both accuracy and speed using data from RFID object tracking and computational astrophysics. These techniques are also shown to allow a tornado detection system to reduce the number of output errors by 2 orders of magnitude, while being able to process high-volume data at stream speed. The results of this project have broader scientific, social, and educational impacts. Evaluation results using real-world data sets, from domains including RFID object tracking, tornado detection, and computational astrophysics, provide direct evidence on the efficiency and effectiveness that the proposed techniques can offer to applications in those domains. The evaluation results obtained from a real tornado detection system indicate that the ability to distill useful data from noisy data can enable tornado detection in real time and with much improved efficiency, which may result in large social impact in future deployments. Besides research activities, this project also involved a number of education efforts, including an integrated undergraduate and graduate curriculum on data management and statistical analysis, and outreach and mentoring activities to engage women in research.