Experience building and fielding data integration systems has shown that they are brittle in a very fundamental way: they cannot handle uncertainty about data or about how data is combined to provide answers. This limitation is especially pronounced in scientific applications, where data is inherently uncertain and the models of the domain are constantly evolving. From the users' perspective, the inability to model uncertainty can result in loss of relevant answers, an explosion of irrelevant answers and in no justification of answers. The limitation is deeply rooted in the deterministic paradigm underpinning data management systems today, which is designed to support scalability to large data instances, but is incapable of representing and reasoning about uncertainty.
A new approach to data integration, where uncertainties are handled explic- Itly, is proposed. Over the past few years, the BioMediator system, which integrates about a dozen public data sources on genes and proteins, has been available. The group has observed and documented the types of uncertainty that limit the power of any mediator-based integration system like BioMediator. These uncertainties occur at three levels: at the data instance level, at the schema level, and at the user query level. In the new approach, all uncertainties will be made explicit in the system, and represented in a uniform way, using a probabilistic data model. The mediator system supports a query language with SQL but with a modified semantics: the answers to each query are annotated with a probability score, and a lineage information.
The new work will involve the design of a probabilistic data model, the development of probabilistic query processing and optimization techniques, and the design of user feedback methods. They will build a system, U2 (short for UII { Uncertain Information Integration ) that will model uncertainty at all levels of the system, including the query language, mediated schema, source mappings and source data. U2 will explain its results to the user and will actively seek to resolve uncertainty when it arises, incorporating feedback from the user where possible. They will extend the BioMediator System and collaborate with the current users of the system.
There are three areas of broader impact. Issues of information integration will be integrated more tightly into the undergraduate and graduate database curriculum Second, the research will fuel collaboration with biomedical computing research, and will extend the BioMediator system that is currently in use by practitioners in the field. Finally, tools and services will be made available for public use.