Due to the importance of uncertain data for a large number of applications, there has been significant recent interest in database support for uncertain data. Existing work in this area includes new models for uncertain data, prototype implementations, and efficient query processing algorithms for specific types of queries. Despite the recent efforts, several important aspects of uncertain data management remain unexplored. This project addresses two of these areas: Query Optimization and Support for Non-Relational Operators.

The first goal is about efficient execution of uncertain data queries. As with traditional data, efficient execution is necessary for ensuring the viability of uncertain data management systems. However, due to the complications of ensuring correct results, and the need for CPU-intensive operations over probability distributions, the goal is critical and challenging. In this project, automatic query optimizations are developed, through query rewriting rules that involve probability threshold operators, corresponding access methods, and cost estimation functions.

The difficulty of handling uncertainty when dealing with non-relational operators has been expressed in many domains. The project aims to advance the capability of tracking the exact impact of uncertain inputs as data is processed by arbitrary programs, leveraging advanced techniques from the area of program analysis. A key problem with traditional Monte Carlo based solutions lies in correctly identifying independence in the output of Monte Carlo simulations. Data lineage tracing, which identifies the set of inputs used to compute an output value, is used to address the challenge. Furthermore, a program dependence tracing based approach is devised to trace the propagation of uncertainty during execution of arbitrary binary code. The technique does not rely on Monte Carlo simulations, and does not require access to source code or domain knowledge.

For further information see the project web page: www.cs.purdue.edu/homes/sunil/uncertainty

Project Report

Uncertainty in data processing is ubiquitous, arising from multiple sources including application semantics, measurement errors, calibration errors, limited sampling, missing data, modeling errors, and limited representation precision in floating point programs. Properly handling such uncertainty is an important requirement, especially in the era of "Big Data", in which a lot of (critical) decisions are made based on (uncertain) data processing. In this project, the PIs aim to develop computational solutions focusing on providing uncertain data management capabilities within and outside database engines. For uncertain data processing outside databases, the PIs have developed the following solutions: (1) a tracing technique for data processing programs to exactly trace the propagation of input uncertainty. (2) runtime program analysis techniques to substantially reduce the cost of performing Monte Carlo analysis on uncertain domains. (3) an on-the-fly monitoring technique that can predict instability caused by uncertainty in floating point representation. For uncertain data processing inside databases, the PIs have developed new query optimization techniques for general Select-Project-Join queries over uncertain data. Earlier work on query optimization over uncertain data has been limited to specific query types (e.g., range, nearest-neighbor, skyline, etc.) or limited uncertain data models. The PIs have introduced a new probability threshold relational operator, and identified a number of query equivalences involving the new operator and the standard select, project, join operators. The new operator have been implemented as part of the Orion system. The rewritten queries are shown to improve the performance of the queries by as much as 90%.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0916874
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2009-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$474,912
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907