Databases and data management tools have a deterministic semantics: a data item either is in the data set or is not. But when data comes from multiple sources, or is extracted automatically, it often contains a variety of imprecisions that are difficult to model using the standard deterministic semantics: the same data item may have different representations in different sources and matching algorithms are imprecise; schemas differ across sources and schema matching tools are imprecise or incomplete or both; data at different sources may hold contradictory information; finally, some data, such as sensor data, is inherently probabilistic and hence imprecise. This project represents all sources of imprecision in a single uniform way, as data with a probabilistic semantics, and extends today''s data management tools to manage efficiently data with probabilistic semantics.
Re-designing databases to handle probabilistic data is a daunting task. This project studies two problems that lie at the core of probabilistic data management: the complexity of the query evaluation problem on probabilistic databases, and the view materialization problem (deciding whether a view can be materialized and whether it can be used in other query plans). The results of this research will consists of a range of fundamental techniques to be used in a general purpose probabilistic query processor.
Intellectual Merits. The project makes new contributions that lie at the intersection of several disparate fields: logic, probability theory, knowledge representation, and traditional query processing and optimization. The project enhances the understanding of the query evaluation problem on probabilistic data, develops new algorithms for efficiently evaluating such queries and for materializing views, while leveraging existing database technology.
Broader Impact. Searching large information spaces (the Web; large collections of scientific databases; Homeland Security data) is one the new and most challenging frontiers in Computer Science. The innovation that is needed to support complex searches that scale to large and heterogeneous information spaces, has to come from the data management research community. This project makes contributions to broaden our ability to search large information spaces. If successful, the project will be one of the pieces that will help data management technology undergo a new paradigm shift, from supporting complex queries with deterministic semantics, to supporting complex explorations with probabilistic semantics.
Project URL: www.cs.washington.edu/homes/suciu/project-probD