Due to the wide range of geographic scales and complex tasks the Government must administer, its data is split in many different ways and is collected at different times by different agencies. The resulting massive data heterogeneity means one cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. To date, all approaches to wrap data collections, or even to create mappings across comparable datasets, require manual effort. Despite some promising work, the automated creation of such mappings is still in its infancy, since equivalences and differences manifest themselves at all levels, from individual data values through metadata to the explanatory text surrounding the data collection as a whole. More general methods are required to effectively address this problem. Viewing the data mapping problem as a variant of the cross-language mapping problem of Machine Translation (MT), this project will employ the new statistical algorithms developed since 1990 in the MT community to discover correspondences across comparable datasets at all levels. In MT, the techniques align words and word sequences across languages. This research will adapt and extend the techniques to consider not only data values (the analogue of words) but also data format/orthography, metadata information, and associated textual information (metadata descriptions, footnotes, etc.) in the alignment process, and to perform alignment learning at three levels: individual data cell level, set of cells (column) level, and multi-column level. Multi-level alignment has not been attempted in MT before. These powerful learning techniques have never been applied to metadata schema integration and/or database alignment or wrapping. If these automatically learned mappings are effective, the amount of manual labor required in database wrapping should be significantly reduced.
Two sets of domain data will be used. Air quality data will be provided by EPA staff at the California Air Resources Board in Sacramento, who periodically integrate data from some 35 regional Air Quality Management Districts throughout California into a single California-wide database, and pass this along to the Federal EPA in North Carolina. Fire emissions data will be provided by a different set of EPA offices, the USDA/Forest Service, and the Department of Interior.