The focus of the proposal is on finding semantically equivalent files in an efficient and scalable manner. If two files are identical, they are candidates for optimizations to reduce storage costs, increase performance, and generally improve the system. Traditionally, two files are only considered equivalent if they are byte-by-byte identical - i.e., byte equivalence. However, this team's preliminary research shows that there are many other files that are essentially equivalent, even though the bytes they contain are not the same. This proposal will investigate how to find such cases and perform optimizations that make use of semantic equivalence, rather than byte equivalence.

This project will design and implement a framework, Facets, which explores new capabilities by applying optimizations to files that are essentially transformed versions of each other. Many optimizations and improvements can be applied to semantically equivalent files, including:

-Ensuring that the security of semantically equivalent files is preserved -Easing backup and maintenance of semantically equivalent files in various formats, fidelities, and versions -Using semantically equivalent files to improve performance, reliability, and availability -Regenerating semantically equivalent files to speed up recovery and network transfer -Selecting which semantically equivalent files to access according to performance or energy constraints

This team's preliminary research shows that 5% of files on a typical user's machine are original content. The rest are copies of files from elsewhere or various derivatives of original content. While leveraging this observation to achieve advantages is not trivial, many significant improvements are possible if one can find these relationships and make proper use of them. These improvements include enhanced security, more efficient backup and restoration, better file caching, more intelligent tradeoffs in performance versus power use, and a host of other possibilities.

Broader Impacts: The code and techniques developed will be released in open source form. The team will take steps (such as applying for supplemental REU grants) to involve undergraduates in the research. They will give talks and recruit at Hispanic-serving institutions. Materials and concepts from the research will be incorporated into classes taught by the principal investigators at their institutions.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1065127
Program Officer
Anita J. LaSalle
Project Start
Project End
Budget Start
2011-08-15
Budget End
2016-07-31
Support Year
Fiscal Year
2010
Total Cost
$349,994
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095