One dominant characteristic of today's large-scale computing systems is the prevalence of large storage clusters. Storage clusters at the scale of hundreds or thousands of commodity machines are increasingly being deployed. At companies like Amazon, Google, Yahoo, and others, thousands of nodes are managed as a single system.

As large clusters have brought many benefits, they also bring a new challenge: a growing number and frequency of failures that must be managed. Bits, sectors, disks, machines, racks, and many other components fail. With millions of servers and hundreds of data centers, there are millions of opportunities for these components to fail. Failing to deal with failures will directly impact the reliability and availability of data and jobs.

Unfortunately, we still hear data-loss stories even recently. For example, in March 2009, Facebook lost millions of photos due to simultaneous disk failures that "should" rarely happen at the same time (but it happened); in July 2009, a large bank was fined a record total of 3 millions pounds after losing data on thousands of its customers; more recently, in October 2009, T-Mobile Sidekick, which uses Microsoft's cloud service, also lost its customer data. These incidents have shown that existing large-scale storage systems are still fragile to failures.

To address the challenges of large-scale recovery, the goal of this project is to: (1) seek the fundamental problems of recovery in today's scalable world of computing, (2) improve the reliability, performance, and scalability of existing large-scale recovery, and (3) explore formally grounded languages to empower rigorous specification of recovery properties and behaviors. Our vision is to build systems that "DARE to fail": systems that deliberately fail themselves, exercise recovery routinely, and enable easy and correct deployment of new recovery policies.

For more information, please visit this website: http://boom.cs.berkeley.edu/dare/

Project Report

Large-scale computing and data storage systems in the cloud are an important platform for much of our society's infrastrcture. A critical factor in the availability, reliability, and performance of cloud services is how they react to failure. While many cloud services are able to handle a single "fail-stop" fault, how they react to other types of emerging faults are less understood. Other types of faults that we now must consider include when multiple components fail simultaneously, when a component misbehaves and fails silently, and when a component exhibits performance that is many times slower thanusual. Thus, the goals of the DARE project are to understand how existing cloud-based services react to these three new types of faults and to develop services that are more robust and scalable. We have analyzed a wide range of existing, widely-used cloud services (e.g., Hadoop, HDFS, ZooKeeper, Cassandra, HBase, and Dropbox) to understand how they react to these three types of failures. By systematically searching through the extremely large space of failures, we have identified numerous implementation and design flaws. For example, even in the relatively well-understood implementation of speculative execution of map-reduce jobs in Hadoop, we uncovered three problems that can lead to a collapse of the entire cluster. First, if all tasks in a map-reduce job are slow, then there is "no" stragglerthat can be identified. Second, imprecise accounting can cause a slow map node to be instead blamed on an otherwise normal reducer node. Finally, a backup task can be improperly restarted on the same problematic components due to memoryless retry. To fix these types of general problems, we have devised, implemented, and evaluated a number of novel solutions. For example, we have developed Fracture, a framework that enables an existing, complex application to be divided into individual mini-processes; each of these mini-processes can be isolated from one another, run in its own environment, and sampled, restarted, or replicated on its own. Second, we proposed an approach for robust RAID storage systems built with flash-based storage devices, Warped Mirrors; Warped Mirrors increase the liklihood that flash devices will not wear out at similar times before a faulty device can be replaced and thus, preserve the independence of device failures within a RAID array. Third, we implemented a robust version of a cloud-based file synchronization service, ViewBox; most importantly, ViewBox will not forward corruptions from one client machine to the cloud version or to other clients and will not propagate inconsistent file system images from a client (such as those that occur after a crash). We describe one of our general solutions in more detail. We propose that distributed systems be built with selective and lightweight versioning (SLEEVE), which can detect silent faults in select subsystems in a lightweight manner (with little space and performance overhead). For example, a developer can pick some important functionality (e.g.,file-system namespace management) and protect that functionality by developing a second lightweight implementation of thefunctionality. This approach essentially transforms a target system into an efficient two-version form. Using the SLEEVE approach, we hardened three pieces of HDFS functionality: namespace management, replica management, and theread/write protocol. Our experimental results show that while the orginal HDFS code silently misbehaves in many cases, HardFS isolates faulty behavior so that it remains within a single node. In particular, HardFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Since errors do not propagate to persistent storage or other nodes, previously fail-silent errors are transformed into fail-stop errors, enabling the use of standard recovery mechanisms such as failover, single-machine reboot, and/or execution of fsck. Furthermore, HardFS detection can often pinpoint corrupt data structures, enabling micro-recovery that repairssmall portions of corrupted state. HardFS is able to micro-recover inseconds instead of rebooting over many hours. Beyond these research contributions and intellectual merits, the DARE project has aided in human resource development and educational impact. In terms of human resources, the project has helped train several M.S. and Ph.D. students; after graduation, these students will work as developers or researchers in either the storage industry or Computer Science academia. In terms of broader educational impact, a service-learning course at UW-Madison has also been created which trains UW-Madison students to teach the basic concepts of Computational Thinking in weekly afterschool clubs at local public schools. The university students gain experience in communicating and teaching, while the elementary-school children gain insight into the beauty and wonder of Computer Science, perhaps shaping their future career paths.

Project Start
Project End
Budget Start
2010-09-15
Budget End
2013-08-31
Support Year
Fiscal Year
2010
Total Cost
$190,000
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715