Rapid advances in sensing technologies, as well as the ubiquity of online social and information networks, have led to unprecedented increase in the size and quantity of data sets that need to be stored and processed. Researchers are only beginning to grapple with the scale of data that modern systems generate. This provides a unique opportunity to design and implement algorithms at a scale that wasn't witnessed before, leading to unique challenges. The focus of this project is on designing algorithms for (arguably) the dominant computing paradigm for big data, MapReduce, and its variants. Despite its broad adoption across industries, there is very little work on developing theoretical models for these systems, or developing algorithms for natural optimization problems that arise in big-data applications.
This project seeks to develop a strong theoretical understanding of large-scale data processing and a principled approach to algorithm design in these settings. The key challenge with MapReduce-type platforms is that they incorporate and modify aspects from several computing paradigms, such as streaming and parallel computing, which leads to significant challenges in modeling such systems and in developing algorithm design principles that apply across platforms. The principal investigator (PI) first seeks to define complexity measures for MapReduce algorithms. The PI next aims to find connections to and differentiate from more classical and widely studied models such as the Parallel Random Access Machine (PRAM), streaming computation, and I/O-efficient computation models, seeking to discover common algorithm design principles and computational limitations in this process. Via this process, the PI will develop efficient algorithms for classes of graph theoretic and statistical optimization problems that arise in big-data settings, and will also develop a suite of practical implementations for these. The problems considered in this proposal are of fundamental nature, as evinced by their long and historic roots in theoretical computer science, parallel computation, and databases. Computational science will benefit from this research. Other broader impacts include graduate and undergraduate training in research and education.