Modern organizations such as social networking service providers, life-science research centers, or security agencies own an unprecedented amount of data. Such organizations want to analyze their data via software applications that are written against that data. Writing, testing, and debugging such data-intensive software applications is notoriously complex. This research develops novel techniques for dealing with this complexity.
The first objective of this research is to develop techniques that can automatically find a representative subset of an existing large-scale data set that allows the programmer to predict how the program will behave on the full data set. The intuition is that the resources needed for finding a representative data subset plus executing the program on that subset can be orders of magnitude lower than running the application on the full data set. The second research objective is to develop techniques that automatically check if a user program violates the correctness conditions imposed by data processing systems that offer a MapReduce-style programming interface.