Modern organizations such as social networking service providers, life-science research centers, or security agencies own an unprecedented amount of data. Such organizations want to analyze their data via software applications that are written against that data. Writing, testing, and debugging such data-intensive software applications is notoriously complex. This research develops novel techniques for dealing with this complexity.

The first objective of this research is to develop techniques that can automatically find a representative subset of an existing large-scale data set that allows the programmer to predict how the program will behave on the full data set. The intuition is that the resources needed for finding a representative data subset plus executing the program on that subset can be orders of magnitude lower than running the application on the full data set. The second research objective is to develop techniques that automatically check if a user program violates the correctness conditions imposed by data processing systems that offer a MapReduce-style programming interface.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1117369
Program Officer
Sol J. Greenspan
Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2011
Total Cost
$497,955
Indirect Cost
Name
University of Texas at Arlington
Department
Type
DUNS #
City
Arlington
State
TX
Country
United States
Zip Code
76019