The significant progress made by database research on schema mapping (e.g., omposition, invertibility), data exchange, and query rewriting, can provide breakthrough solutions for the Database Schema Evolution problem. But as of today, information systems are sorely lacking the methods and tools needed to cope with the problem, and to reduce the cost of data migration, rework of queries, application rewriting, and downtime created by schema changes. In fact, this old problem has been made worse by the success of scientific databases (e.g., Ensembl) and web information systems (e.g., Wikipedia)---where the fast evolution of applications and requirements characterizing the web and the scientific discovery process is exacerbated by the number and diversity of users and organizations cooperating on these endeavors.. Fortunately, the openness of these public-domain information systems (vs. corporate ones), and the abundance of their interesting evolution histories make it possible to built a comprehensive testbed to determine the strengths, limitations, and potentials of candidate methods and tools proposed for the problem.

Thus, this project is building: (i) an open-source curated repository containing evolution histories from key information systems, (ii) benchmarks for a comprehensive set of tools tested therein, and (iii) instruments to collect and analyze evolution histories. These are then used to (a) compare and evaluate existing approaches, methods and tools, and (b) entice researchers to evaluate and improve their techniques and add their test cases to the benchmark. A transformative impact can be expected upon schema mapping research and applications, inasmuch as theoretical solutions are now validated and improved on real-life case-studies. These in turn are expected to transform and improve significantly scientific databases and web information systems. For further information see the project web page: www.cs.ucla.edu/~zaniolo/nsf0917333.html

Project Report

Modern Information Systems experience frequent revisions and changes in response to changes in their requirements and the environment they are supporting. These changes demand significant expenditures in time and resources; these are particularly serious for database schema changes which require upgrading of the database queries and revising the applications---a time-consuming and error-prone task involving experts and professionals. Schema-evolution problems occurred in traditional business-oriented databases, but they now occur even more frequently in databases designed to support scientific information systems and collaborative web systems, inasmuch as the pace of changes in these moder information is accelerated by frequent scientific discoveries, and the number and diversity of participating users and organizations. Therefore, there is a growing need to support and automate (i) the process of upgrading schemas and the applications that depend on it, and also (ii) the documentation of such upgrades and the overall schema evolution history. The importance of the second objective is underscored by (a) the growing interest in documenting the provenance of data and metadata and (b) the emergence of cloud computing where the management of the IS schema is delegated to third parties. The schema evolution problem been recognized for a long time, but methods and tools been have been slow to come about.A first problem has been the lack of general techniques for automating the rewriting of queries and updates after schema changes. However significant advances on this problem have recently been made by researchers working on schema mapping and database integration. However these techniques have yet to extended to related problem of schema evolution, and few case studies exists that be used by researchers to frame this problem. However, But while in the past database administrator might have been weary of releasing this information about their corporate IS, the situation has improving dramatically because web information systems, such as Wikipedia, and big science projects such as Ensembl, which make schema evolution information available to the public. This made possible the realization of our research project that has pursued and realized the following two objectives: A. Building a rich, curated, and integrated testbed of schema evolution histories and tools wherewith (i) the various facets and technical challenges of the problems are exposed, and (ii) the effectiveness of current and future tools and techniques in addressing these problems can be stress-tested on complex real-life information systems, B. Optimizing the documentation and management of schema histories and other metadata in our testbed to provide effective support for (i) queries about the history of past data and the provenance of current ones, and (ii) queries on the history of the schema (e.g., for DB administrators and analysts who need to learn about past upgrades before implementing new ones). Therefore, to realize objective A, we first investigated a large number of public-domain information systems, and then selected thirty such systems for inclusion in our testbed. Our selection includes popular web information systems such as Wikipedia and XOOPS, Medicine/Biology Databases, such as Ensembl Genetic DB andBioSQL, and CERN Physics DBs, including GridCC and ATLAS. For each information system in our testbed, we extracted its schema history using various tools developed for this purpose. Then we represented each schema evolution step using Schema Manipulation Operators (SMOs). For this, we used our SQL2SMO system, that can recognize all changes in the schema history, along with the Schema Manipulation Operators used to transform each schema version into its following one. Out testbed is freely available from: http://yellowstone.cs.ucla.edu/schemaevolution/index.php/Benchmark_home With respect to objective B, our project's main contribution was the design of a system called AM&PM (for Archived Metadata&Provenance Manager). Thus, AM&PM extends the ANSI SQL INFORMATION_SCHEMA facility, which is already supported by most database vendors, with time-stamp based archival capabilities. Now, AM&PM allows users to retrieve past versions of the schema version along with the SMOs used to convert past data and upgrade past queries. In our papers, we showed that the AM&PM functionality is critical not only to recount the history of the database schema but also the provenance of the current data, whereby full auditability can be achieved in modern information systems.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0917333
Program Officer
Frank Olken
Project Start
Project End
Budget Start
2009-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$475,453
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095