This Small Business Innovation Research (SBIR) Phase I project addresses the problem of integrating information about named entities, such as people, companies, and products, from numerous data sources. Integrating information about entities from multiple sources can be difficult because sources may use different formats and terminology to describe the same entity, a problem referred to as "entity resolution". Most existing commercial enterprise systems rely on rule-based matching techniques for entity resolution. This project investigates statistical learning techniques that allow a system to estimate the probability of a match, rather than computing a score based on ad-hoc rules or weights. Because the approach is based on sound statistical principles and uses evidence compiled from large datasets, it can produce more accurate results than existing methods. Moreover, these advantages are amplified when handling data that that has highly variable, missing or noisy attributes, such as data extracted from Web sites.

The broader impact/commercial potential of this project lies in enabling enterprises to perform more accurate and reliable data integration. The are many potential target markets that need better technology for integrating information about businesses, products, people, locations, and other entities. This capability is critical for some of the nation's largest companies and institutions, from search engines, to the U.S. Intelligence and law enforcement community, to financial institutions. In particular, large enterprises often have difficulty utilizing data extracted from news, foreign language data sources, and social media, because the extracted data is noisy and not-well structured. The technology developed in this project will help enterprises make use of the growing amount of information on the Web, so that they can take advantage of the network of relationships that link people, companies, and other entities to serve their customers better.

Project Report

The growth of the Internet has made it much easier to aggregate information about named entities, such as companies and products, from numerous data sources. However, integrating information from multiple sources can be difficult due to the use of different formats and terminology to refer to the same entity. For example, one source may refer to "St. John’s Hospital in LA" and another might refer to the same entity as "Saint John’s in Santa Monica". This matching problem, often referred to as entity resolution, arises in a variety of commercial applications. Current enterprise systems typically employ ad-hoc rules that are manually configured by trial and error. However, these systems tend to perform poorly on data extracted from heterogeneous sources, such as text documents. In this project, we build on recent progress in statistical machine learning algorithms to address challenging entity resolution applications beyond the capabilities of present systems. In particular, our work investigated a statistical learning approach that allows a system to estimate the probability of a match, rather than computing a score based on ad-hoc rules. Because the approach is based on sound statistical principles and uses evidence compiled from large datasets, it can produce more accurate results than rule-based approaches. Moreover, these advantages are amplified when handling data that that has highly-variable, missing or noisy attributes. Whereas rule-based approaches can be brittle due to difficulties inherent in combining ad-hoc scores for different attributes, statistical inference methods can deal with such problems gracefully. There were two primary focuses in our Phase I project. The first was to investigate extensions to our statistical model necessary for handling diverse types of real-world data. In particular, we addressed several challenging issues, including 1) integrating data extracted by multiple natural language text extractors, each of which may be noisy in different respects, 2) handling relational data, that is, data about the relations between two or more entities, rather than simple attributes, and 3) resolving entities in multiple languages, which can involve a combination of transliteration and translation. We designed and evaluated extensions to our existing statistical model to handle these challenges. We also investigated a joint inference approach that we can use to implement the extensions in phase II. In addition to designing extensions to our existing model, we also collected and analyzed commercially-relevant datasets to help us prioritize our design work, and to estimate the performance improvements we can expect from our technical enhancements in different domains. This work establishes the foundation for our phase II project, where we have proposed implementing, refining and testing the designs that we developed in phase I.

Agency
National Science Foundation (NSF)
Institute
Division of Industrial Innovation and Partnerships (IIP)
Type
Standard Grant (Standard)
Application #
1143373
Program Officer
Juan E. Figueroa
Project Start
Project End
Budget Start
2012-01-01
Budget End
2012-12-31
Support Year
Fiscal Year
2011
Total Cost
$179,992
Indirect Cost
Name
Inferlink Corporation
Department
Type
DUNS #
City
El Segundo
State
CA
Country
United States
Zip Code
90245