Large Knowledge Bases are constructed today automatically from large corpora of text, like Web pages, journal articles, news stories. The construction proceeds in two major stages. First, several database queries are computed on the corpora of text, to extract candidate data items; the resulting data, called a factor graph, can be thought of as a very large, noisy, uncertain, redundant, and inconsistent database. Second, a complex probabilistic inference is performed on the factor graph to produce a large, probabilistic knowledge base. Both stages are computationally expensive, but only the first stage has benefited so far from advances in database query processing techniques. This project develops new database processing techniques for the probabilistic inference task. These new techniques have theoretical guarantees, either in the form of absolute guarantees on the runtime of the probabilistic inference, or in the form of a trade-off between the run time and the precision of the probabilistic inference.
The main technique pursued by the project is called lifted probabilistic inference, and consists of algorithms that compute the probability of a SQL query inductively on the structure of the query, without having to first ground the query to compute the large factor graph. Lifted inference is very efficient, but possible only for some queries. The project has four thrusts. First, it combines sampling with lifted inference for efficient approximate probabilistic inference for any query; this algorithms can pushed in the database engine, and can therefore benefit immediately from all optimizations available today in modern, parallel query processors. Second, the project studies the complexity of query evaluation on symmetric databases, a special case of high practical importance, since it scales easily to arbitrarily large domains. In the third thrust, the project extends lifted inference techniques to queries with negations by combining probabilistic inference with resolution; this is necessary because soft constraints in knowledge bases almost always have negations. Finally, the project develops a system prototype and benchmarks.