This project seeks to develop incremental processing abstractions and technologies to address the approaching bottleneck in processing unstructured web-scale data. Government, medical, financial, and web-based services increasingly depend on the ability to rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows (e.g., weekly web crawls or nightly telescope dumps). Current approaches to processing unstructured data have driven the development of massively parallel "ad-hoc" data processing systems, such as MapReduce. However, they process data in a snap-shot fashion, forcing massive re-computations when even a small amount of new data arrives.

The core of the project consists of a cluster-based incremental data processing system that overcomes these limitations. A key component is a dataflow programming model that combines massive parallelism and flexible, incremental computations. An incremental processing controller orchestrates multiple backend data processing tasks, ensuring reliable, consistent operation in the event of node failures. The project seeks to shed light on the fundamental challenges and benefits of incremental processing for ad-hoc data by using both industrial and e-science applications. For example, through cooperation with Yahoo! Research, the project will vet existing prototypes on real-world web-indexing dataflows and large data sets. While in the short term the project provides a platform for such highly skilled operators, the long-term goal is to significantly advance the methods and abstractions that the scientific community and commercial world use to tackle processing vast, dynamic data sets.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
0834784
Program Officer
Krishna Kant
Project Start
Project End
Budget Start
2008-09-01
Budget End
2009-08-31
Support Year
Fiscal Year
2008
Total Cost
$100,000
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093