This project is developing new techniques for identifying and managing files, replacing tree-structured file names with content- and metadata- based search access. By leveraging existing work in search and recognizing the explosion in the volume of data stored, this project enables users to find and access their data in natural and intuitive ways, based on the files' contents, tags the user has assigned, system metadata, and provenance (information about the file's origins). This research targets high-end computing (HEC) users, who manage billions of files generated by measurement devices, experimentation, or scientific workflows. The techniques and system developed are also applicable to general-purpose computing.

Realizing this goal requires advances in several areas. First, the project is designing and developing fast, scalable mechanisms to gather, maintain and index the large volume of metadata and provenance that HEC applications and users generate. This project is also exploring search algorithms that operate on graph structures, enabling users to find files "near" their current workspace. To enable users to access this functionality, the project is developing a new "language" that facilitates the kind of searches that users need.

Project Report

Modern science is often done on high-performance computing systems, which generate petabaytes of data across millions of files for just a single experiment. Finding previous results in this data is difficult because, unlike Web searches, files are often poorly labeled and difficult to search because they consist of raw data, not text. The goal of this project was to explore better ways to organize metadata - information about files - using provenance (the relationship of files to other files) and other information, such as that provided by users to identify files or generated by programs that analyze files' contents. We analyzed existing file systems to understand how users stored and named their files, and conducted interviews with scientists to understand how they wanted to track their data and how they actually did it. We found that many users could benefit from improved metadata management: for example, some users of multi-petabyte file systems kept notes on their files in Excel spreadsheets or even paper notebooks, risking data loss. We investigated techniques to search more metadata using less hardware by partitioning data using different criteria: who could access the data, how the files were generated (workflow), and the tags that made up the metadata itself. We found that partitioning data has great promise for improving metadata management by reducing the amount of metadata that must be searched to find desired results. In effect, this is similar to separating indexes for English and Chinese data at Google and skipping indexes containing Chinese if someone is searching for an English document. Since provenance is an important factor in metadata systems, we developed techniques for storing provenance very efficiently, compressing the provenance graph that details which files are descended from which other files. This reduction in storage space allows us to leverage more provenance, resulting in more effective searches. We explored new techniques for naming files as well, since file names are the mechanism by which users interact with the file system. We developed TrueNames, a technique that creates standardized file names for files based on templates, allowing users to customize names based on file characteristics and to change name formats as needed. We also began investigating hierarchical namespaces for files, allowing for large flat namespaces that can be searched by "tag", in a way similar to Web searches. Unlike Web searches, however, our approach groups a relatively small number of files in each namespace (hundreds to a million or so), and allows namespaces to reference other namespaces. This approach limits the scope of searches, allowing users to find files that may be relevant for them but not others. Lastly, we made a major advance in securing large-scale files with minimal impact to metadata overhead. Our approach protects terabyte-scale files used on clusters of thousands of computational nodes from compromise due to one or more "corrupted" nodes. In traditional file systems, a corrupt node can read an entire file. Under our approach, a corrupt node can only read the data it needs for its own local computation; this is typically under 1% of the entire file. This is done without increasing the amount of metadata that must be maintained for the file, and without increasing the load on a shared metadata server, facilitating the use of encryption to secure large files without much added cost.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0937938
Program Officer
Almadena Y. Chtchelkanova
Project Start
Project End
Budget Start
2009-10-01
Budget End
2013-09-30
Support Year
Fiscal Year
2009
Total Cost
$553,000
Indirect Cost
Name
University of California Santa Cruz
Department
Type
DUNS #
City
Santa Cruz
State
CA
Country
United States
Zip Code
95064