This project addresses the next major impediment to the continued adoption of "big-data" analytics---the management of their life cycle, which includes debugging, tuning, and auditing. Today, data-intensive analytics are improving operations across multiple industries, translating terabytes of raw data into useful data analysis. Taking advantage of big data will be necessary to sustain competitive advantages for areas ranging from power generation, to retail, oil exploration, manufacturing, various scientific disciplines, and national security. However, the extreme scalability of these data processing architectures hides inefficiencies and obfuscates performance analysis, creating both obvious and hidden costs to their adoption. Tuning and debugging large data-intensive workflows is currently a black art that mostly consists of tedious manual analysis.

The research seeks to dramatically alter how data scientists design and debug their analytics to sidestep this authoring and deployment bottleneck. In particular, the PI's are developing scalable, efficient architectures for capturing fine-grain data lineage, information that tracks the use of data through the analytic pipeline, from a range of data-intensive scalable computing (DISC) systems. Such lineage serves as a basis for discovering inefficiencies and suggesting optimizations via step-wise debugging, fault tracing, anomaly detection, and lineage-driven data cleaning and data mining. The development and open-source release of such lineage-capture and analysis platforms promises to dramatically accelerate the adoption of big-data analytics.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1219220
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2012-09-01
Budget End
2018-03-31
Support Year
Fiscal Year
2012
Total Cost
$450,000
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093