One dominant characteristic of today's large-scale computing systems is the prevalence of large storage clusters. Storage clusters at the scale of hundreds or thousands of commodity machines are increasingly being deployed. At companies like Amazon, Google, Yahoo, and others, thousands of nodes are managed as a single system.
As large clusters have brought many benefits, they also bring a new challenge: a growing number and frequency of failures that must be managed. Bits, sectors, disks, machines, racks, and many other components fail. With millions of servers and hundreds of data centers, there are millions of opportunities for these components to fail. Failing to deal with failures will directly impact the reliability and availability of data and jobs.
Unfortunately, we still hear data-loss stories even recently. For example, in March 2009, Facebook lost millions of photos due to simultaneous disk failures that "should" rarely happen at the same time (but it happened); in July 2009, a large bank was fined a record total of 3 millions pounds after losing data on thousands of its customers; more recently, in October 2009, T-Mobile Sidekick, which uses Microsoft's cloud service, also lost its customer data. These incidents have shown that existing large-scale storage systems are still fragile to failures.
To address the challenges of large-scale recovery, the goal of this project is to: (1) seek the fundamental problems of recovery in today's scalable world of computing, (2) improve the reliability, performance, and scalability of existing large-scale recovery, and (3) explore formally grounded languages to empower rigorous specification of recovery properties and behaviors. Our vision is to build systems that "DARE to fail": systems that deliberately fail themselves, exercise recovery routinely, and enable easy and correct deployment of new recovery policies.
For more information, please visit this website: http://boom.cs.berkeley.edu/dare/
The DARE project advances cloud recovery testing techniques and thus improve the dependability of cloud systems. The DARE project produces several outcomes. With FATE (Failure Testing Service) and DESTINI (Declarative Testing Specifications), recovery is systematically tested in the face of multiple failures and correct recovery is specified clearly, concisely, and precisely. FATE and DESTINI can explore over 40,000 failure scenarios including multiple failures in a single system and easily express tens of recovery specifications. With PreFail (a programmable failure-injection tool), which advances FATE, testers can write a wide range of policies to prune down the large space of multiple failures and spend 10X–200X less time than exhaustive testing. With HARDFS, cloud storage such as HDFS can recovery from fail-silent (non fail-stop) behaviors that result from memory corruption and software bugs. HARDFS employs a new approach: selective and lightweight versioning (SLEEVE). HARDFS recovers orders of magnitude faster than full reboot by using micro-recovery. With SAMC (Semantic-Aware Model Checking), the scalability of distributed systems model checkers are significantly improved. With simple semantic information of the target cloud system, SAMC can alleviate redundant reorderings of messages, crashes, and reboots during state exploration process. SAMC can can find deep distributed system bugs one to two orders of magnitude faster compared to state-of-the-art distributed system model checkers. Finally, with CBS (Cloud Bug Study), the largest bug study for cloud systems to date, new unique problems such as scalability bugs, data consistency bugs, and many others specific to cloud systems are analyzed and provided to the cloud dependability research community. Broader Impact: The DARE project places significant value on technology transfer; the outcomes of the project have led to direct industrial impact. For example, approaches from DARE cloud testing frameworks have been adopted by several industries that deploy cloud clusters. In addition, predictability is a key to success of multi-billion dollar computing, and DARE results address important concerns raised in building sustainable computing such as failure detection, isolation, diagnosis, and prediction. Finally, users from many areas (science, healthcare, business, education, military, and government) are increasingly use large-scale storage and computing services, and the outcomes of DARE project improve the reliability and availability of these services. .