This Small Business Innovation Research (SBIR) Phase I project will address the issue that enterprises today are faced with the problem of linking their disparate structured databases with unstructured text documents like articles, manuals, reports, emails, blogs, folksonomies, and others. There is no easy way to perform a federated search, let alone enable more intelligent applications over such diverse data sources without considerable time and effort spent in system and data model customization by experts. With the recent emergence of commercial grade Resource Description Framework (RDF) triple stores it becomes possible to merge massive amounts of structured and unstructured data by defining a common ontology model for the DBMS schemas and representing the structured content as semantic triples. Lymba proposes novel methods to transform unstructured data sources inside corporate firewalls into a consolidated RDF store, merge it with other ontologies and structured data, and moreover offer a natural language question answering (QA) interface for easy use. To make the QA robust, an innovative hybrid approach is proposed that draws answers from the RDF store as well as directly from indexed text documents.

The potential impact of delivering a question answering system that operates on a commercial grade RDF store is significant as it fills a need for users of this store to easily access more information and quickly implement intelligent applications using natural language questions as the main vehicle. The proposal also leads to enabling technology software to advance the semantic web. If successfully deployed, the proposed research has the potential to translate into a viable commercial product with significant revenues.

Project Report

Enterprises are faced with the problem of linking disparate structured databases together with unstructured text documents like knowledge articles, product literature, call center data, social media, engineering notes, and others. There is no easy way to perform a single federated search, let alone enable sophisticated applications over diverse data sources without considerable time and effort. With the recent emergence of commercial grade Resource Description Framework (RDF) triple stores it became possible with minimal effort to merge massive amounts of structured and unstructured data under a common ontology model for seamless access to heterogeneous data. The "Hybrid Question Answering Combining a Search Index with an RDF Store" (HQA) project developed a hybrid search prototype that explored: 1) efficient and accurate algorithms and tools to transform unstructured document content into a rich and complete semantic representation that is compatible with the RDF standard; and 2) methods of information access to enable intelligent applications and hide the underlying complexity of the voluminous semantic data being searched. Specifically Lymba’s HQA project researched and developed innovative tools and algorithms to: 1. Transform text into RDF triples for seamless integration of text resources with structured data that is already available in databases; 2. Push inferences closer to the data to draw implicit relationships and make data smarter; 3. Provide a natural language interface to the rich semantic information in a scalable RDF triple store; and 4. Combine the information in the triple store with a free text search index guaranteeing robustness of the QA application. Lymba’s research efforts for the 6-month performance period of the Phase I project resulted in algorithms for: scalable transformation of free text into semantic RDF triples, knowledge driven inference capabilities, and an innovative natural language interface that hides the complexities of the programmatic language interface to the RDF store. These algorithms were individually evaluated for performance and then combined into a prototype hybrid question answering system that is a first to provide a natural language interface to a smart, semantically rich RDF data store. In our evaluations the HQA system measured a 19.31% mean reciprocal rank improvement over a regular free text search index question answering system. As a scientific impact, the prototype proves that question answering technology benefits significantly from the fusion of deep semantic information from heterogeneous data sources. While the broader impact of the HQA project is the development of enabling technologies for knowledge workers to easily and quickly access information using a natural language search interface. This effort demonstrates the feasibility of heterogeneous question answering, and will be extended in Phase 2 to operate in an international commercial environment serving the CRM market.

Agency
National Science Foundation (NSF)
Institute
Division of Industrial Innovation and Partnerships (IIP)
Type
Standard Grant (Standard)
Application #
1113285
Program Officer
Muralidharan Nair
Project Start
Project End
Budget Start
2011-07-01
Budget End
2012-06-30
Support Year
Fiscal Year
2011
Total Cost
$179,760
Indirect Cost
Name
Lymba Corporation
Department
Type
DUNS #
City
Richardson
State
TX
Country
United States
Zip Code
75080