Science and analysis today are increasingly tackled by systematic exploration of high-resolution captured, or simulated, data. With a more expansive sample of raw data, more detailed models and precise questions of the meaning of the data can be formed. However, massively parallel systems for processing massive data sets render traditional programming, storage and fault tolerance strategies ineffective.

Table-based or column-oriented distributed data storage systems are being developed to support such large scale data analysis, led by Google?s BigTable and including open source variations such as Apache Hbase. These new systems have the flavor of database row and column organization, but have simpler semantics, weaker isolation, and non-SQL interfaces, for example. The effectiveness of these new systems for applications other than internet search support is not well understood.

This exploratory project is developing an evaluation framework and exploring a set of these new table-based storage systems, with the goal of capturing an understanding of the state of the art, how they perform and scale, and their reliability and usability.

In addition to benchmarks focussing on key metrics, the project's evaluation framework includes real world applications drawn from machine learning approaches to understanding streams of events, such as internet blog publications, and approaches to understanding complex interrelationships, such as social networking graphs, in order to extract insight about the requirements needed to enable these emerging types of knowledge discovery applications.

Project Report

Data analytics, data science and big data are names for applications that discover, extract and characterize patterns and relationships from large collections of raw data, typically computer or business transactions or records from physical sensors like satellites or medical equipment. Many of the techniques of data analytics are drawn from underlying statistical tools that become unwieldy and ineffective when the amount of data involved is thousands to millions of times larger than the statistical techniques were designed to solve. Even the most sophisticated and expensive database and data warehousing technology has demonstrated ineffectiveness at emerging big data scales. Modern data analysis has responded to this limitation with scalable table-based distributed storage systems and parallelized statistical algorithms. This project, an exploratory pilot project for informing a larger collaboration with the US Department of Defense, was to improve the state of the art testing and benchmarking facilities for scalable table-based distributed storage systems and use this experience to inform subsequent proposals and projects. The outcome of this project were 1) an open-source testing and benchmarking tool, YCSB++, 2) five PhD level students and seven MS level students trained in the characteristics, strengths and weaknesses and evaluation techniques for scalable table-based distributed storage systems, and 3) a collaboration with a Department of Defense user community interested in using scalable table-based distributed storage systems in government applications. Our open-source testing and benchmarking tool, YCSB++, is a set of extensions on the prior open-source tool, YCSB, for testing cloud storage. Our extensions evaluate and deconstruct the behavior of advanced features of these scalable table-based distributed storage systems -- cell-level security mechanism overhead, offloading from the client machine of filter processing to the table server, deep batching of sequences of data insertions and the latency of exposure to the eventual (weak or temporarily incorrect) consistency semantics induced, data ingest acceleration by pre-splitting tables and bulk loading of pre-formated table data. The tool was evaluated and documented in a publication in the ACM Symposium on Cloud Computing conference (SOCC), October 2011. In this publication we evaluated an open source table-based distributed storage system (Apache Hbase) and a DoD-authored scalable table-based distributed storage system, Accumulo, now also an Apache open-source project. Most of the students involved in this project are now working in startups or with leaders of the data analytics infrastructure industry such as Google or Amazon. The collaboration with the Department of Defense was successfully initiated and follow on work, funded outside the NSF, is ongoing.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1019104
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2010-08-01
Budget End
2012-07-31
Support Year
Fiscal Year
2010
Total Cost
$300,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213