This project studies methods for extracting accurate knowledge bases from the Web. Fully-automated Web information extraction techniques are massively scalable, but have accuracy and coverage limitations. This proposal investigates how to improve automated extraction techniques by introducing carefully-selected human guidance. The proposed system continually extracts knowledge from the Web, along the way dynamically synthesizing and issuing queries to humans to increase the accuracy of the system's knowledge base and extractors.

The approach extends the PI's previous work utilizing statistical language models (SLMs) for information extraction. Novel SLMs are investigated for unifying the extraction of relational data expressed in Web tables with extraction from free text. New active learning techniques utilize the models to identify "high-leverage" queries -- requesting, for example, textual extraction patterns that when retrieved from the Web yield thousands of novel extractions. The queries investigated are mostly amenable to non-experts, meaning that much of the human input can be acquired at scale via online mass-collaboration.

The broader impact of this project lies in the potential for accurate Web extraction to radically improve Web search, allowing users to answer complicated questions by synthesizing information across multiple Web pages. In domains like medicine and biology, mining extracted knowledge bases could lead to important discoveries and novel therapies.

Further information may be found at the project web page: http://wail.eecs.northwestern.edu/projects/activelms/index.html

Project Report

This project studied methods for automatically extracting knowledge bases from the World Wide Web. The goal behind our work is to transform the Web’s vast human-readable content into machine-understandable knowledge. This capability would enable transformative technologies, such as new search engines that answer complex questions by synthesizing information scattered across the Web. We focused on three primary research questions: How can we integrate knowledge extracted from both Web text and Web tables? How can statistical language models trained over large text corpora help improve extraction accuracy? How can an extraction system actively solicit well-selected human input to improve the extraction process? The project led to the invention of new knowledge extraction techniques, primarily aimed at Wikipedia’s text and data tables. A fundamental knowledge extraction challenge involves automatically identifying relationships between concepts. We developed state-of-the-art methods for estimating the degree of semantic relatedness (SR) between two Wikipedia concepts, along with new methods for explaining the relationships to Web users in natural language. These methods leveraged machine learning techniques to mine Wikipedia’s text, hyperlinks, and categories for semantic information. We also developed new techniques for extracting data from Wikipedia tables and automatically joining together different tables that contain related information. We also developed new methods for scaling up statistical language models (SLMs) for information extraction. "Latent-variable" SLMs have been shown to improve extraction systems, but the memory required to train the models forms a bottleneck. We developed a new method for overcoming the memory bottleneck, based on intelligently partitioning the corpus across a parallel computing cluster. Our experiments showed that the partitioning method decreases the memory footprint of model training by half for large data sets. The broader impacts of our work included student training, public prototype applications, and the release of data sets and code to the research community. Multiple PhD, MS, and undergraduate students participated in our research and co-authored publications. We also delivered a public prototype demonstrating our table extraction research, called "WikiTables." An additional public prototype of the "Atlasify" system, which uses our semantic relatedness research to create interactive visualizations query concepts (e.g. "nuclear power") on familiar reference systems (e.g. the World Map or periodic table), is under development. We disseminated our work to the research community in the form of multiple papers at major conferences and workshops, and we released other resources (including a codebase for our SLM training technique, new datasets for SR and table extraction, and a scalable public API for computing SR). The papers, prototypes, and other research products are publicly available. For further information, please consult the project Web site: http://websail.cs.northwestern.edu/activelms/

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1016754
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-08-15
Budget End
2013-07-31
Support Year
Fiscal Year
2010
Total Cost
$183,736
Indirect Cost
Name
Northwestern University at Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60611