This proposal aims to enable scalable and adaptable information integration over unstructured data at a large scale. There is a need to be able to do structured queries with unstructured data, for example in executing SQL queries over Web pages. This project will develop a new approach of "query push-down," distinctive from the conventional "data pull-up" techniques, as a promising direction for accomplishing agility in integration. The technical objectives will be driven by two application domains: Army land planning and the Illinois digital library. The team will develop query translation techniques that "pushes down" queries to a format that can be executed over unstructured document and feature indexes. This approach will eliminate expensive, inflexible, and often fragile extraction of unstructured data, enabling scalable and adaptable information integration through "best effort" semantics. In the query push-down approach, queries are no longer executed by the SQL-like Boolean semantics, but would rather take a maximum likelihood interpretation-- i.e., what are the most likely answers, by properly translating a given query, under the presence of uncertainty and lack of preciseness in data? The team will study the formalism that governs the principles of such probabilistic query execution, for achieving "best effort" with probabilities as a formal quality metric. Researchers will build the Data-oriented Content Query System , which will support users of Web data not only keywords but also data types to query for relevant values of their desired data in the contents of the corpus, by specifying flexible patterns and customizing scoring functions. Structured queries will be translated for executing in the system to access and integrate the unstructured contents in the corpus.

The successful results in this proposed research will have significant impacts in two areas. The research community has observed the scalability limitation of the current integration schemes. These observations highlight the urgency of the proposed study for developing large-scale, agile integration techniques. This will formally advance the understanding of large-scale best-effort integration and develop a set of general techniques. Second, the development of the query system engine will provide access to the data-rich Web, with practical deployment at the Illinois Gateway of the UIUC digital library, which will improve students and faculty?s access to online scholarly and open information. Students will be directly involved in the research effort and new curricula are planned.

Project Report

This project studied and developed new search methods for querying large-scale unstructured text data, such as the Web, with data-oriented semantics-- queries that allow users to specify entities and relationships-- in a semantically effective and computationally efficient way. Such data, while unstructured, contains rich structured (or, relational) data, and today's search techniques do not tap into this potential. Towards such data oriented search capabilities, this project developed techniques in several areas. We characterize query capabilities in terms of the input (keywords or entities) and output, and systematically developed a series of techniques: 1) Querying by Entities: Users have a clear target entity in mind, and wish to collect relevant information about the entity. For example, many people have their favorite celebrities such as movie stars, and are interested in tracking celebrities’ activities everyday; for business people, they are interested in collecting useful user experiences for their products, which is critically helpful for future quality improvement. In such scenarios, entities are involved in the input, and users expect to find out some entity-related information, usually represented in the form of documents. 2) Querying for Entities: Users expect to retrieve some particular entities in the returned result. For example, when students are applying to PhD programs, they would like to know the top universities in their interested fields; when a PhD student is surveying a research topic, she may want to find out the leading researchers in that area. In these scenarios, people are looking for some particular entities (e.g., universities, professors) that satisfy their information needs. 3) Querying by and for Entities: In this case, users specify and expect entities in both input and output of their queries. For example, when a customer is unsatisfied with a newly bought iPad, she would search the phone number of Apple’s customer service for complaints. In such an example, the input entity would be "Apple’s customer service," and the target entity is a phone number. Similar examples include finding the CEO of Amazon, the treatment for anxiety disorders, etc. In these scenarios, people are interested in finding entities that match some particular relations with given entities. 4) Prototype System: Our research was driven by building the Content-Oriented Query System, which employs structured queries, called CQL (Content Query Language) to query unstructured text corpus, encompassing the series of query capabilities we developed. Structured queries in CQL are translated into underlying textual queries to look for entities and relations over a corpus of text data, thus achieving querying unstructured data with structured queries. The publications, demonstrations of systems, and datasets can be found via the PI's homepage.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1018723
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$500,000
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820