The PI proposes to investigate the numerical accuracy and robustness of randomized algorithms for matrix multiplication and overdetermined least squares problems. Existing analyses of randomized algorithms are mostly concerned with asymptotic time and space complexity in exact arithmetic, and very little is known about their numerical behavior in floating point arithmetic. The PI proposes to develop a numerical perturbation and stability theory for randomized algorithms for matrix multiplication and least squares problems. This entails the invention of new approaches and concepts to capture the numerical behavior of randomized algorithms. It is not at all clear what ``numerical stability'' means in this context, let alone how it should be defined. How does one distinguish variability caused by randomization from variability caused by finite precision? Where should parameters like failure probability, choice of probabilities, and amount of sampling be accounted for? Proposed approaches for answering these questions will include matrix perturbation analysis, probability theory, and methods on matrix manifolds. Extensive numerical experiments will be performed to corroborate the analyses.

The motivation for randomized algorithms is the need for streaming massive data sets that are too large for traditional deterministic algorithms. Randomized algorithms have been implemented successfully for applications such as pattern recognition, social network analysis, population genetics, circuit testing, and text classification. The proposed research will help to determine for which application domains a randomized algorithm is suitable, and it will also result in practical bounds and recommendations for parameter choices to achieve a user-specified accuracy. The proposed research is highly relevant because randomized algorithms will be indispensable for exascale computing, in applications like high energy physics and astronomy, where peta bytes of data are expected to stream in daily and tasks like rare event detection make it imperative to have a good understanding of numerical accuracy and robustness.

Project Report

"Big data", the buzz word these days, means solving problems where the amount of incoming data is massive, often arriving in relentless uninterrupted streams. Examples include images beamed down from satellites circling the earth, and electromagnetic frequencies collected by telescopes scanning the night sky. A radical way to cope with too much stuff is to throw away most of it. Randomized algorithms -- in a very over-simplified nutshell -- do so by playing dice on what to throw away and what to keep. The question is under which circumstances radical dice playing can be relied upon to produce useful information. This question was investigated for tasks that, though basic, form fundamental building blocks for computational methods in statistics, science, engineering, and the data sciences. The tasks at hand are matrix computations, and in particular matrix multiplication, and least squares/regression. Intellectual Merit: 1. The PI assessed the accuracy of a Monte-Carlo algorithm for performing Gram matrix multiplication. Combining both, the traditional (deterministic) and a probabilistic perspective shows that the Monte-Carlo algorithm can produce useful if limited information -- but only if the data are highly redundant. (In technical terms: The structure of the right singular vector matrix determines whether the Monte Carlo algorithm has any chance at all to reproduce the exact product. The computed results in general can be accurate to 1-2 significant digits, provided the matrix has low rank and low stable rank.) 2. One way to calibrate the "dice playing" part of a randomized algorithm is to design a testbed, where the data can be easily tuned to be: highly redundant, extremely discriminating, or anything in between. The algorithm's reaction to the testbed is evaluated by analyzing the data pieces that it decides to keep, and how well they represent the pieces that were thrown away and the data set as a whole. This investigation comes in especially handy for the solution of large-scale least squares/regression problems. Employing a randomized algorithm to construct a pre-processor for the solution process can result in significant time savings. (In technical terms: For uniform sampling of rows from matrices with orthonormal columns, the PI derives probabilistic bounds on the condition number of the sampled matrices, in terms of coherence and leverage scores. To validate the bounds and develop the test bed, the PI designs several algorithms for generating orthonormal matrices with specified coherence and leverage scores. One application is the solution of large-scale least squares/regression problems. When a Krylov space solver is accelerated by a randomized preconditioner, the condition number bound is required for establishing the convergence rate.) 3. The research has provided a proof of principle for analyzing the accuracy and reliability of randomized algorithms in practical situations, for data sets of all sizes. (In technical terms: It is possible to derive explicit probabilistic error bounds that hold in a non-asymptotic context; and that are informative even when applied to matrices of small dimension, under stringent success probabilities.) Broader Impacts: 1. This research has produced analytical end empirical results to assess the accuracy and reliability of randomized matrix algorithms. The results establish the viability of randomized algorithms for practical applications with small and large datasets; and they assist users in deciding whether a randomized algorithm should be considered for their specific application. 2. The research has produced a testbed for the "dice playing" part of a randomized algorithm, which is user-friendly software package in the form of a Matlab Toolbox called kappa_SQ. Code and documentation for kappa_SQ are available on arXiv. kappa_SQ is designed with both, experts and students in mind. A variety of parameter choices makes it easy for experts to experiment with randomized algorithms, to perform empirical evaluations, and to prototype novel algorithms by importing their own code. The default settings in kappa_SQ are designed for students, so they can immediately experience the algorithm behavior in graphical form.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1145383
Program Officer
Balasubramanian Kalyanasundaram
Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-11-30
Support Year
Fiscal Year
2011
Total Cost
$85,000
Indirect Cost
Name
North Carolina State University Raleigh
Department
Type
DUNS #
City
Raleigh
State
NC
Country
United States
Zip Code
27695