Commercial data providers have exploited open sources such as published telephone directories, property transactions, and bankruptcy filings for years. However, recent concerns and legislation related to privacy have caused many sources to redact attributes that provide explicit entity identification, e.g., removing street number and name from the address, using only the last four digits of the security number.

Adding to the challenge are semi-structured and unstructured sources such as narrative event notices, e.g., obituary, birth, marriage, divorce. These sources typically have complete names, but provide even fewer address clues, in some cases only naming the city of last residence or the city where the event occurred. These kinds of Partially Redacted Open Sources (PROS) are difficult to automate for several reasons: - The documents do not explicitly state all of the attributes defining a unique identity - The documents are often in semi-structured or unstructured text formats that complicate attribute extraction - Knowledge management is difficult because each type of PROS has a different semantic ontology often with many possible terms even though each instance is sparsely populated Despite these challenges, PROS can be data rich, providing important supplementary entity information. For example, an obituary may give a complete set of family relationships to the decedent including parents, children, and siblings.

The primary nature of the proposed research is to investigate and develop effective methods and techniques for resolving the identity of entities appearing PROS. The primary objective of the project will be to improve and extend the methods and techniques developed in previous research for the specific case of identification of individuals in online obituary notices [1, 2] and multi-agency entity resolution [3], and demonstrate that these same methods and techniques can be effective when applied to other types of PROS.

The anticipated output of the research is a set of technical papers documenting in detail the methods and techniques developed in the project and assessments of their effectiveness in various PROS contexts. Where appropriate, software prototypes developed as part of the project will be included as part of the project deliverables.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0635655
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2006-08-15
Budget End
2008-07-31
Support Year
Fiscal Year
2006
Total Cost
$235,000
Indirect Cost
Name
University of Arkansas Little Rock
Department
Type
DUNS #
City
Little Rock
State
AR
Country
United States
Zip Code
72204