This goal of this research project is to understand the tradeoffs between the MapReduce and parallel DBMS approaches to performing large-scale data analysis over large clusters of computers, and to bring together ideas from both communities. Both MapReduce and parallel database systems provide scalable data processing over hundreds to thousands of nodes. Both provide a stylized, high-level programming environment that allows users to efficiently filter and combine datasets while masking much of the complexity of parallelizing computation over a cluster. But they differ in substantial ways as well, such as their approaches to dealing with fault tolerance, their data modeling requirements, their query flexibility, and their ability to function in a heterogeneous processing environment.

This multi-university team of researchers is investigating the effect of these differences on the performance and scalability of these two approaches. The research team is running a set of experiments that compare an open source MapReduce implementation (Hadoop) to two commercial parallel database systems (DB2 and Vertica) on a benchmark that includes a range of tasks designed to assess the tradeoffs between both approaches. The research team is seeking to understand which differences between the two approaches to performing large scale data analysis are fundamental tradeoffs, and which differences are possible to combine inside a single solution, so that ideas from one community can benefit the other.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0843487
Program Officer
Vijayalakshmi Atluri
Project Start
Project End
Budget Start
2009-02-01
Budget End
2012-01-31
Support Year
Fiscal Year
2008
Total Cost
$109,506
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715