This is the first year funding of a three year continuing award. Many important information systems applications require access to data stored in multiple autonomous databases. These databases often use different hardware, software, database management systems and are distributed among various geographical locations. Heterogeneity among the databases, in the systems, semantics and data distribution poses challenge to the conventional data retrieval techniques. In this project, key issues related to the processing of heterogeneous data are investigated. In particular, the focus of the work is on joining of the records of the same instances across disparate databases. A new operator, the Entity Join, has been defined. It uses the common attributes present in the joining data to probabilistically estimate the correctness of the result. The model is extended to cover all types of data heterogeneity problems including temporal heterogeneity. Additional operators are defined to enhance the data manipulation capability of the model, and their optimization explored. This project also examines the use of rules in the process of matching objects. The different algorithms are first simulated to assess their performances, and then prototypes built to further evaluate the optimization concepts. The results of this work will facilitate data analysis and exploration in many statistical and scientific applications including census, surveys, epidemiology, biology, astronomy and other areas.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9116770
Program Officer
Y. T. Chien
Project Start
Project End
Budget Start
1992-09-01
Budget End
1996-08-31
Support Year
Fiscal Year
1991
Total Cost
$218,953
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704