Increasing numbers of real-world application domains are generating data that is inherently noisy, incomplete, and probabilistic in nature. Examples of such data include measurement data collected by sensor networks, observation data in the context of social networks, scientific and biomedical data, and data collected by various online cyber-sources. The data uncertainties may be a result of the fundamental limitations of the underlying measurement infrastructures, the inherent ambiguity in the domain, or they may be a side-effect of the rich probabilistic modeling typically performed to extract high-level events from sensor and cyber data. Similarly, when attempting to integrate heterogeneous data sources ("data integration") or extracting structured information from text ("information extraction"), the results are approximate and uncertain at best. However, there is currently a lack of data management tools that can reason about large volumes of uncertain data, and hence the information about the uncertainty is often either discarded or reasoned about only superficially.

In this project, we are building a complete probabilistic data management system, called PrDB, that can manage, store, and process large-scale repositories of uncertain data. PrDB unifies ideas from "large-scale structured graphical models" like probabilistic relational models (PRMs), developed in the machine learning literature, and "probabilistic query processing", studied in the database literature. PrDB framework is based on the notion of "shared factors", which not only allows us to express and manipulate uncertainties at various levels of abstractions, but also supports capturing rich correlations among the uncertain data. PrDB supports a declarative SQL-like language for specifying uncertain data and the correlations among them. PrDB also supports exact and approximate evaluation of a wide range of queries including inference queries, SQL queries, and decision-support queries.

The cross-disciplinary research undertaken during this project will enable us to simultaneously address the challenges in the areas of probabilistic databases and machine learning, and allow us to transfer the key technologies developed between those areas, thus advancing the research in both areas. It will enable the development of a significant and high-impact new class of real-world applications, in a variety of domains including health informatics, social media management, World Wide Web, and scientific databases. The PrDB system source code, and the datasets generated during the project, will be released using an appropriate open source license, at the project web site: www.cs.umd.edu/db/PrDB.html

Project Report

This NSF-funded project was motivated by the observation that increasing numbers of real-world application domains are generating data that is inherently noisy, incomplete, and probabilistic in nature. Examples of such data include measurement data collected by sensor networks, observation data in the context of social networks and scientific databases, and data collected by various online cyber-sources. The data uncertainties may be a result of the fundamental limitations of the underlying measurement infrastructures, the inherent ambiguity in the domain, or they may be a side-effect of the rich probabilistic modeling typically performed to extract high-level events from sensor and Web data. Similarly, when attempting to integrate heterogeneous data sources ("data integration") or extracting structured information from text ("information extraction"), the results are approximate and uncertain at best. It is relatively straightforward to capture such uncertainties by adding appropriate annotations to the data, with "probabilities" being the most natural and intuitive form of such annotations. However, traditional data management tools cannot properly reason or operate upon such annotated data. Further, executing even simple analysis tasks on such data turns out to be intractable in many cases. The goal of this project was to develop a "probabilistic" data management system to manage, store, and query large-scale repositories of data annotated with probabilities. While there has been prior work on developing such probabilistic data management systems, most of that prior work considered simplistic correlation models that are insufficient to represent the uncertainties in many of the application domains. The key outcomes of this project can be summarized as follows. We built a unifying framework, and a prototype system called "PrDB", for combining the rich modeling power of structured graphical models, developed in the machine learning literature, and the large-scale data processing capabilities of database systems. PrDB framework is based on the notion of "shared factors", which not only allows one to express and manipulate uncertainties at various levels of abstractions, but also supports capturing rich correlations among the uncertain data. PrDB supports a high-level declarative SQL-like language for specifying uncertain data and the correlations among them. PrDB also supports exact and approximate evaluation of a wide range of queries including inference queries, SQL queries, and decision-support queries. To build PrDB and to make it efficient at handling large volumes of uncertain data, we developed a suite of novel algorithms and data structures. We showed how query evaluation in probabilistic databases is equivalent to "inference" in probabilistic graphical models, and developed several novel inference and query processing techniques. We designed and implemented indexing structures for querying large-scale correlated datasets that result in orders-of-magnitude performance improvements. We developed a learning-based approach to "ranking" over probabilistic data that combined many previous approaches in a single, elegant framework. Our algorithms work for both discrete and continuous uncertainties, can scale to very large datasets, and can handle correlations in the data. We developed the notion of "consensus answers" that promises to significantly increase the utility of probabilistic databases, by allowing us to systematically convert probabilistic query results into deterministic query results; the latter are better suited for further analysis and decision making by the users and applications. We proposed and developed the notions of "influence" and "explanation" to better understand query results, and to facilitate robust query processing over uncertain databases. We proposed a general uncertain graph model that captures a variety of different uncertainties, including "identity uncertainty" that was not captured by prior models. We also showed how to efficiently perform subgraph pattern matching queries over such uncertain graphs. Overall this project has shown how large volumes of uncertain data can be efficiently stored in a relational database, and how a variety of queries can be efficiently executed over it. This project thus paves the way for incorporating uncertain data management in commercial database systems. More details about the project, and the publications that resulted from it, can be found at: www.cs.umd.edu/~amol/PrDB.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0916736
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2009-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$498,538
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742