Affordable virtualized computing resources have enabled storage and processing of information at a scale never previously imagined possible. Cloud computing is enabling what has been called the industrial revolution of data. Algorithms to store, process, compress and protect this information at a cloud scale are much needed. Much of this infrastructure is currently developed without a good understanding of fundamental principles from information theory and signal processing.

The workshop `Communication theory and Signal Processing in the Cloud Era' will take place on June 25 and 26 at University of California, Berkeley. The meeting focuses on bringing together internationally known experts in academia and industry to identify new challenges in cloud computing, coding techniques and content distribution at large scale.

Project Report

This project made possible a workshop held in Berkeley, California, in 2012 on "Distributed Storage" that was attended by some of the world's most renowned researchers in the field. A brief summary is that there are exciting challenges and opportunities for the use of coding in next-generation distributed storage systems. The following topics were identified: 1) Distributed storage systems like Apache Hadoop are becoming the norm for big data analytics and storage. Many such systems deploy clusters of thousands of servers and store tens of petabytes. At this scale, several disk and node failures per day are common. Typical systems like the Google file system (GFS) and the Hadoop file system (HDFS) rely on replication strategies to provide reliable information storage. For example, the default replication value in Hadoop HDFS is 3, which means that three copies of each file are stored in different locations. Considering the massive storage scales at which these clusters operate, any small reduction the storage overhead can bring down the total storage used and thus the cost of storage considerably. Recently, several cloud storage systems have started using erasure coding techniques instead of replication to reduce this storage overhead. Research on the use of codes for these systems is a promising future direction identified by the workshop participants. 2) Low Latency to the Cloud: In addition to providing reliability to the data, another important operation in any data center is to serve the user requests as fast as possible. For instance, a recent study shows that people will visit a Web site less often even if it is slower than a close competitor by just 250 milliseconds~cite{nytimes_impatient}. While the use of codes for providing improved reliability in archival storage systems, where the data is less frequently accessed (or so-called ``cold data''), is well understood, the role of codes in the storage of more frequently accessed and active ``hot data'' is less clear and a promising research direction. 3)Image Delivery and Caching: Photos are initially grouped in large archive files before being encoded. This happens because current systems require files of the order of hundreds of MBs for the coding overheads to diminish. Like all content, photos are separated into `hot' and `cold' content depending on access statistics. Hot data are replicated multiple times to ensure fast delivery and cold data is protected by progressively larger codes that save storage.Optimizing delivery and caching algorithms for large image databases is going to be an important challenge.

Project Start
Project End
Budget Start
2012-06-15
Budget End
2013-05-31
Support Year
Fiscal Year
2012
Total Cost
$39,279
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94710