There is a rich research literature on information integration (e.g., on data fusion, data integration, and data exchange, including schema matching, mapping, and composition), knowledge-representation, ontologies, and semantic web technologies. However, there has been little prior work on the related problem of merging annotated datasets that already have largely compatible schemas, but where data values of some fields can come from (or link to) different concept hierarchies (taxonomies). Combining datasets into a single, consistent representation is a prerequisite for addressing many important scientific questions (e.g. those that rely on data to be expressed at broad spatial, temporal, and taxonomic scales). In practice, scientists combine multiple datasets manually, a time-intensive and error-prone process. In many application domains (e.g., biodiversity, ecology, systematics) data are often annotated with concepts from different but interrelated taxonomies. For instance, scientists who wish to combine datasets that record the presence or absence of species at given locations are often faced with datasets that draw species names from different taxonomies. In such cases, merging datasets requires aligning the different taxonomies. However, even for aligned taxonomies (i.e., where formal articulation constraints are given), many different dataset merges are possible, including inconsistent or incomplete ones. These in turn can yield different or even contradictory outcomes in subsequent interpretations and downstream data analysis. The primary goals of this project are to develop new techniques at the interface of data integration, knowledge-representation, and reasoning, to empower scientists by giving them new tools for merging and 'logically debugging' taxonomies and annotated datasets. The proposed Euler toolkit will include a formal framework with a broad range of constraints and data types; novel provenance-based techniques to detect, explain, and repair inconsistencies in taxonomy alignments; and new techniques to reduce uncertainty in alignments.

For further information see the project web site at the URL: www.daks.ucdavis.edu/projects/euler

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1118088
Program Officer
James French
Project Start
Project End
Budget Start
2011-10-01
Budget End
2016-09-30
Support Year
Fiscal Year
2011
Total Cost
$479,186
Indirect Cost
Name
University of California Davis
Department
Type
DUNS #
City
Davis
State
CA
Country
United States
Zip Code
95618