The scale of data produced in the world is growing faster than the ability to process it. One approach for dealing with this data deluge involves sublinear algorithms, which are designed to estimate some feature of data without needing to store, or sometimes even see, the entire data set. Sublinear algorithms include distribution testing (e.g., estimating if a lottery is fair or not), streaming algorithms (e.g., finding the most common URLs on the web), and property testing (e.g., estimating the maximum degree of a network). For these problems, computer scientists have carefully studied how the complexity of the solution grows with the problem parameters---for example, estimating if a lottery is fair requires a number of draws that scales with the square root of the number of possible numbers drawn. But results so far have not been able to analyze the solution complexity for concrete instances (e.g., for a birthday lottery with 366 possible numbers, how many samples are necessary to verify fairness?). This project aims to change that, by finding solutions with not only good asymptotic scaling, but good constant factors.
Developing sublinear algorithms with good constant factors will require new algorithmic techniques. The sublinear-algorithms literature is based on several widespread techniques like probability amplification that are simple, general, and optimal up to constant factors---and significantly suboptimal in their constant factors. By replacing these techniques with more fine-grained ones, this project aims to develop new algorithms with better performance in practice. This project also aims to empirically measure the worst-case performance of algorithms, by identifying which input distributions are provably hardest to solve. By testing different algorithms in practice, the project will discover and compare the actual impact of different algorithmic choices.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.