This project addresses the next major impediment to the continued adoption of "big-data" analytics---the management of their life cycle, which includes debugging, tuning, and auditing. Today, data-intensive analytics are improving operations across multiple industries, translating terabytes of raw data into useful data analysis. Taking advantage of big data will be necessary to sustain competitive advantages for areas ranging from power generation, to retail, oil exploration, manufacturing, various scientific disciplines, and national security. However, the extreme scalability of these data processing architectures hides inefficiencies and obfuscates performance analysis, creating both obvious and hidden costs to their adoption. Tuning and debugging large data-intensive workflows is currently a black art that mostly consists of tedious manual analysis.
The research seeks to dramatically alter how data scientists design and debug their analytics to sidestep this authoring and deployment bottleneck. In particular, the PI's are developing scalable, efficient architectures for capturing fine-grain data lineage, information that tracks the use of data through the analytic pipeline, from a range of data-intensive scalable computing (DISC) systems. Such lineage serves as a basis for discovering inefficiencies and suggesting optimizations via step-wise debugging, fault tracing, anomaly detection, and lineage-driven data cleaning and data mining. The development and open-source release of such lineage-capture and analysis platforms promises to dramatically accelerate the adoption of big-data analytics.