Intellectual Merit: Data-intensive Scalable Computing (DISC) is increasingly important to peta-scale problems, from search engines and social networks to biological and scientific applications. Already datacenters built to support large-scale DISC computing operate at staggering scale, housing up to hundreds of thousands of compute nodes, exabytes of storage, and petabytes of memory. Current DISC systems have addressed these data sizes through scalability, however the resulting per-node performance has lagged behind per-server capacity by more than an order of magnitude. For example, in current systems as much as 94% of available disk I/O and 33% of CPU remain idle. This results in unsustainable cost and energy requirements. Meeting future data processing challenges will only be possible if DISC systems can be deployed in a sustainable, efficient manner.
This project focuses on two specific, unaddressed challenges to building and deploying sustainable DISC systems:
-a lack of per-node efficiency and cross-resource balance as the system scales, and -highly-efficient storage fault tolerance tailored to DISC workloads.
This project's approach is to automatically and dynamically ensure cross-resource balance between compute, memory, network, and underlying storage components statically during system design, as well as dynamically during runtime. The goal is to support general DISC processing in a balanced manner despite changing application behavior and heterogeneous computing and storage configurations. This work will result in a fully functional prototype DISC system supporting the Map/Reduce programming model to support general-purpose application programs.
Broader impacts include: -training diverse students, such as undergraduates and underrepresented groups - to understand DISC services as an interesting part of the overall curriculum and as a resource for interdisciplinary collaboration. -a public release of the proposed balanced runtime system, including support for higher-level programming models; -working with industrial partners as part of UCSD's Center for Networked Systems to address sustainability and efficiency issues in this critical portion of industrial and governmental data processing.
A key challenge in cloud and "Big Data" computing is ensuring that as data volumes grow, the resulting systems remain highly efficient and low cost. Despite the promising scaling behavior of data processing platforms like Hadoop MapReduce and Microsoft Dryad, their delivered per-node performance has lagged behind the innate per-node capability of servers, often by as much as an order of magnitude. A recent survey of several deployed platforms found that they deliver less than 10% of the underlying system capability. While they are scalable, that scale hides (and magnifies) significant inefficiency. This growing gap represents a significant amount of wasted resources, power, and cost. Our research in this area, supported by NSF grant CSR-1116079, has focused on optimizing a key bottleneck of Big Data systems--the storage and network components. In studying deployed platforms, we found that even simple jobs like "sort" involve many more IO operations than necessary, as a result of imbalance in the overall system design (here balance refers to the specific mixture of compute, memory, storage, and network bandwidth). We address this deficiency by focusing on designing efficient IO pipelines for data processing systems, and to that end designed and built two data processing systems in increasing complexity and utility. The first system is TritonSort, a case study in efficient sorting. Sorting forms the kernel of a large class of data processing applications, and its importance has been popularized by Jim Gray in an annual competition called Terasort (now GraySort). We designed TritonSort based on an observation from Agarwal and Vitter that any system that performs an out-of-core sort must perform at least two read and write operations, in the worst case. We say that any system that meets this lower bound has the the "2-IO" property. Existing scale-out platforms typically far exceed this lower bound, leading to inherent inefficiencies. We designed TritonSort to exactly meet the 2-IO lower bound by building it as a staged, pipeline-oriented dataflow system in which individual stages have a flexible interface to manage CPU, typed memory buffers and buffer pools, disk I/O, and network I/O. Rather than adopting a task-oriented design like Hadoop MapReduce, in which multiple fine-grained tasks operate over portions of the data, TritonSort is organized into two phases. During the first phase the entirety of the input data is read and distributed to the appropriate destination node. In the second phase, this unsorted intermediate data is sorted and stored. The second system we developed is Themis, an implementation of MapReduce that also achieves the 2-IO lower bound. Themis can support a wide variety of applications, including a DNA short read sequence application and a social network analysis tool. When compared to the (then) winner of the GraySort benchmark, Themis sorted 100 TB of data faster in absolute time, yet used 66 times fewer nodes. Impact: TritonSort and Themis have had research impact through publications in NSDI, SoCC, and the ACM TOCS journal. According to Google Scholar, publications from this project have been cited 78 times. TritonSort/Themis has attained five world records for performance and energy efficiency in 2010 and 2011, as recorded by http://sortbenchmark.org. It has won the 100TB "Indy" and "Daytona" GraySort records, the "Indy" MinuteSort record, and both 100TB "JouleSort" records. These latter two categories measure how much data can be sorted in 1 Joule of energy, which demonstrates the energy-efficiency that can be gained from our design and methodology. Members of this project have been invited to give numerous presentations on 2-IO computing, including at Google, USC, the University of Washington, and HP Labs.