Many important problems in machine learning, scientific computing, and statistics benefit from fast procedures for solving numerical linear algebra problems. At the same time, many large-scale datasets, such as internet search logs, network traffic, and sensor network data, have been studied in data-stream literature. Surprisingly, a number of fast procedures used in numerical linear algebra have been made possible by exploiting tools that were originally developed in the data-stream literature. These developments are based on a technique called sketching, which is a tool for quickly compressing a problem to a smaller version of itself, for which one can then afford to run a much slower procedure on the smaller problem. A major goal of this project is to study foundational problems in the data-stream domain, and develop their connections to problems in numerical linear algebra. The algorithms developed here will be accessible to graduate and undergraduate students from computer science, machine learning, and mathematics, and the investigator plans to integrate the results of the project into a graduate course on algorithms for big data, as well as undergraduate algorithms courses. The investigator is actively working with underrepresented minority and undergraduate researchers on topics directly related to this project.

The first main thrust of this project is to develop new techniques for fundamental problems in data streams where the goal is to use minimal amount of memory while running algorithms over one pass on a data stream. Such problems include statistical problems - such as estimating the variance, moments, most frequent items, heavy hitters - for many of which optimal memory bounds are still unknown. Another challenge in this domain is that of processing massive streams for real-world graphs, such as geometric intersection graphs, which arise in cellular networks and scheduling theory. By studying a breadth of problems in different corners of data streams, the investigator plans to develop new techniques and build unexpected applications. The second main thrust of the project is to develop new algorithms and hardness results for fundamental problems in numerical linear algebra, connecting such problems to the study of data streams. CountSketch, which is a low-memory heavy hitters algorithm, paved the way for obtaining time-optimal algorithms for regression. This project will continue to push forward such connections, for example, by studying CountSketch in the context of tensors, which is a new application domain. The project will also investigate many deterministic (instead of randomized) data stream algorithms, and understanding their role in linear algebra. A major goal is to understand the limitations of speedups to linear algebra problems. One particular problem of interest here is to obtain low rank approximation with spectral norms.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2018-06-01
Budget End
2021-05-31
Support Year
Fiscal Year
2018
Total Cost
$433,978
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213