Large Knowledge Bases are constructed today automatically from large corpora of text, like Web pages, journal articles, news stories. The construction proceeds in two major stages. First, several database queries are computed on the corpora of text, to extract candidate data items; the resulting data, called a factor graph, can be thought of as a very large, noisy, uncertain, redundant, and inconsistent database. Second, a complex probabilistic inference is performed on the factor graph to produce a large, probabilistic knowledge base. Both stages are computationally expensive, but only the first stage has benefited so far from advances in database query processing techniques. This project develops new database processing techniques for the probabilistic inference task. These new techniques have theoretical guarantees, either in the form of absolute guarantees on the runtime of the probabilistic inference, or in the form of a trade-off between the run time and the precision of the probabilistic inference.

The main technique pursued by the project is called lifted probabilistic inference, and consists of algorithms that compute the probability of a SQL query inductively on the structure of the query, without having to first ground the query to compute the large factor graph. Lifted inference is very efficient, but possible only for some queries. The project has four thrusts. First, it combines sampling with lifted inference for efficient approximate probabilistic inference for any query; this algorithms can pushed in the database engine, and can therefore benefit immediately from all optimizations available today in modern, parallel query processors. Second, the project studies the complexity of query evaluation on symmetric databases, a special case of high practical importance, since it scales easily to arbitrarily large domains. In the third thrust, the project extends lifted inference techniques to queries with negations by combining probabilistic inference with resolution; this is necessary because soft constraints in knowledge bases almost always have negations. Finally, the project develops a system prototype and benchmarks.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1614738
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2016-07-15
Budget End
2020-06-30
Support Year
Fiscal Year
2016
Total Cost
$516,000
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195