The burgeoning amount of textual data in distributed sources combined with the obstacles involved in creating and maintaining central repositories motivates the need for effective distributed information extraction and mining techniques. Different kinds of records on a given individual may exist in different databases - a type of data fragmentation. Even with standards, however, the ability to integrate schemas automatically is an open research issue. A related issue is the fact that current Association Rule Mining (ARM) algorithms for mining distributed data are capable of mining data (whether vertically or horizontally fragmented) only when the global schema across all databases is known. In the case of information extracted from distributed textual data, no preexisting global schema is available. This is due to the fact that the entities extracted vary between documents - new input text can contain previously unseen entities. As a result, a fixed global schema cannot be assumed and existing algorithms cannot be employed.

This effort describes a distributed higher-order text mining framework that requires neither the knowledge of the global schema nor schema integration as a precursor to mining rules. The framework, termed D-HOTM, extracts entities and discovers rules based on higher-order associations between entities in records linked by a common key. The entity extraction is based on information extraction rules learned using a semi-supervised active learning algorithm previously developed. The rules learned are applied to automatically extract entities from textual data that describe, for example, criminal modus operandi. The entities extracted are stored in local relational databases, which are mined using the D-HOTM distributed association rule mining algorithm.

The broader impacts of thework lie in the collaboration with local law enforcement and healthcare providers for deploying live test beds that enable problem solving by mining reports and identificaiton of physician best practices. Pre-college internships are provided for students as well as support for graduate students.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0534276
Program Officer
Lawrence Brandt
Project Start
Project End
Budget Start
2006-01-01
Budget End
2007-01-31
Support Year
Fiscal Year
2005
Total Cost
$288,141
Indirect Cost
Name
Lehigh University
Department
Type
DUNS #
City
Bethlehem
State
PA
Country
United States
Zip Code
18015