Probabilistic databases have many important applications in areas such as crime fighting, data cleaning and integration, moving object tracking, and science, and have recently moved into the limelight of data management research. Probabilistic databases are expected to allow for new data management applications to arise that currently require substantial bootstrapping and development effort, if they are feasible at all. Probabilistic databases allow for very flexible ways of representing and managing incompleteness and are a possible foundation for a new breed of more encompassing and powerful data integration systems. Probabilistic databases could give rise to more effective, lazy integration systems, in which the input is initially inserted into the database with a high degree of uncertainty, which is subsequently reduced using various forms of corpus data, evidence, and probabilistic mappings and rules.
The primary goal of this project is to perform foundational research and develop a probabilistic database management system that can support such applications. The first aim of this project is to work on efficient query processing techniques, with a particular focus on the essential problems of computing tuple confidence values and the closely related problem of conditioning a probabilistic database. Both are known to be computationally hard, and efficient techniques are expected to have to make use of the state of the art in algorithms for constraint satisfaction. In some applications, such as ad-hoc query processing for decision support, exact query results are not necessary. Previous work has pointed out efficient approximation techniques for simple queries, namely for the ranked retrieval of results of conjunctive queries together with their confidences; however, there is currently no work on approximating queries that themselves make use of confidence values, for example in selection operations.
Approximating expressive, compositional queries on probabilistic databases is a second main focus of the planned project. Thirdly, nearly all previous work on probabilistic databases has focussed on designing representation systems and query languages and laying the foundations for efficient query processing. For real data management applications to use probabilistic database systems in the future, it must also be possible to update such databases, and to run transactional programs on them. While no work on this exists in the literature, uncertainty in the data poses a number of fundamental challenges to be addressed as the third main aim of the project.
The homepage of the MayBMS project which is the subject of this proposal can be found at www.cs.cornell.edu/database/maybms/ .