Scalable Data Management Using Metadata and Provenance

Seltzer, Margo

Abstract

This project is developing new techniques for identifying and managing files, replacing tree-structured file names with content- and metadata- based search access. By leveraging existing work in search and recognizing the explosion in the volume of data stored, this project enables users to find and access their data in natural and intuitive ways, based on the files' contents, tags the user has assigned, system metadata, and provenance (information about the file's origins). This research targets high-end computing (HEC) users, who manage billions of files generated by measurement devices, experimentation, or scientific workflows. The techniques and system developed are also applicable to general-purpose computing.

Realizing this goal requires advances in several areas. First, the project is designing and developing fast, scalable mechanisms to gather, maintain and index the large volume of metadata and provenance that HEC applications and users generate. This project is also exploring search algorithms that operate on graph structures, enabling users to find files "near" their current workspace. To enable users to access this functionality, the project is developing a new "language" that facilitates the kind of searches that users need.

Project Report

This project addressed the problem of managing data collections containing billions of files and petabytes of data. Analysis of scientific workloads revealed that in such collections, the names of files frequently contain important meta-data, data describing how the file was produced. When users access such objects, they frequently search through large collections, selecting objects with particular values for these meta-data. By developing computational systems that automatically generate and maintain provenance, the description of how data are derived, it is possible to provide better search and navigational tools to users. Workload analysis also revealed that data in scientific data sets exhibit two distinct phases: creation, during which reading is unnecessary, and use, during which udpates are unnecessary. The implications of this are that conventional file system interfaces and implementations designed around concurrent reading and writing are not well matched to this domain. Instead, a prototype implementation exhibits an API that separates these two phases, allowing for improved storage and retrieval efficiency. The project team developed several other prototype systems, each of which transparently collects provenance. For example, the provenance aware storage system (PASS) collects provenance in file systems, the electronic lab notebook (Burrito) collects provenance from the graphical user interface as well, and any program can capture provenance by integrating the Core Provenance Library with the application. In each case, the system maintains a simple representation of the provenance and allows for the creation of references between the systems, so a user could, for example, trace an object's history from arrival of a data file in the user's email box, through the execution of a program that used that data file, through graphing software that produced an image of a computed result. By indexing this provenance data, systems then allow users to find files by issuing queries. For example, "Find me the email message that had the data I usedto generate this image," or "Where is the output I created based on the data in this message?" Provenance not only documents the history of an object, providing new ways to access data, but also exposes valuable information about the object itself. The project team developed techniques for mining provenance to extract attributes of a file, such as whether the file contains program source code or shared data definitions. Provenance data is inherently graph-structured; each object has a set of objects on which it depends and potentially another set of objects that depend upon it. These dependencies, which form the edges between vertices in a graph, may represent different relationships-- for example, in a scientific environment, one edge might indicate that a file provided data as input to a program while another edge might indicate that one object is a copy of another object. In social networking, some edges indicate friendship; others indicate membership in a group. Managing graph-structured data efficiently and securely are two key challenages also undertaken in this work. The SybilSafety approach, developed to protect information in graphs, accepts a security policy expressed as a set of constraints on what information can be made public and what information should be hidden and then uses those constraints to produce a collection of graphs, all of which reveal public information from the graph, but the set of which provide sufficient ambiguity to hide the data that should not be revealed. This approach allows for audits by a third party, confirming that the released data satisfy the security constraints. This is the first general and practical solution to provable guaranteed privacy enforcement on graph-structured data. Graph-structured data is not well-served by conventional datamanagement approaches, such as relational databases. Although first generation graph databases are available, they do not perform well on datasets that exceed the size of memory. The team developed graphdb-bench, a software benchmarking suite and methodology that allows for detailed analysis of the behavior of a graph database.The benchmark provides predictive models that, when compared against actual performance, reveal key aspects of a database's implementation.Such insights are used in two ways: they suggest how to select appropriate storage technology and they inform the design of next generation graph database systems. The collection of tools and results make fundamental research contributions in computer science and are also enabling technologies for other domains.The PIs worked closedly with researchers in high performance physics,astronomy, and systems biology to ensure that the problems they tackled and the solutions they designed and developed were suitable for computational science environments.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Application #: 0937914
Program Officer: Almadena Y. Chtchelkanova

Project Start
Project End
Budget Start: 2009-10-01
Budget End: 2012-09-30
Support Year
Fiscal Year: 2009
Total Cost: $351,643
Indirect Cost

Scalable Data Management Using Metadata and Provenance
Seltzer, Margo
Harvard University, Cambridge, MA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments