Data matrices have structural properties that present challenges and opportunities for both the Numerical Linear Algebra (NLA) community and the Theory of Algorithms (ToA) community. Matrix factorizations, such as the eigendecomposition, the rank-revealing QR factorization, and the Singular Value Decomposition, have been widely used for information retrieval. Historically, matrix factorizations have been of central interest in NLA since one can use them to express a problem in such a way that it can be solved more easily. ToA, on the other hand, has recently addressed the computation of such decompositions from a sampling perspective. The two approaches are complementary. However, thus far, the two communities have not worked closely together to integrate them. More often than not, each community is only cursorily aware of developments in the other community. Defining significant research directions that both communities can work on, and applying the resulting linear algebraic algorithms to data analysis problems (among others) will lead to important breakthroughs. The main objective for this proposal is to bridge the existing gap and bring together NLA and ToA researchers to promote cross-fertilization of ideas that could have immediate and long-term impact on data analysis. Towards that end, the PIs will work on set of prototypical research problems that can significantly benefit from ideas and research in both NLA and ToA. These problems range from approximating the singular values and vectors of a matrix by element-wise sampling to random-projection-based algorithms for least-squares problems and the design of randomized algorithms for the widely used non-negative matrix factorization.
This proposal seeks to explore the complementary perspectives that the Numerical Linear Algebra and the Theory of Algorithms (ToA) communities bring to linear algebra and matrix computations. This is a timely quest, motivated by technological developments over the last two decades that permit the automatic generation of large datasets. Such datasets are often modeled as matrices. The proposed work will serve as a demonstration project on the fruitfulness of collaboration between the NLA and the ToA communities on problems that are of common interest. It is expected that the proposed research will demonstrate commonality in the two approaches, as well as highlight the advantages of the dual perspective. Through outreach activities, the PIs hope to motivate even more researchers to undertake similar investigations on related topics. The proposed algorithms will be numerically evaluated on a suite of matrices from application domains that the PIs have been working on over the past few years, in order to understand better their properties and to demonstrate their potential in dealing with the modern, massive datasets. More specifically, the PIs will test the proposed strategies on population genetics data in order to infer ancestry of individuals, as well as gene expression data in order to investigate hypotheses that correlate genes and diseases. As such, we expect that the developed algorithms will impact the areas of linear algebra, randomized algorithms, information retrieval and data mining, as well as bioinformatics. Finally, in order to disseminate the proposed research, the PIs intend to organize workshops (following the example of the Workshops on Algorithms for Modern Massive Datasets in 2006, 2008, and 2010; the PIs were co-organizers of these workshops) and working group meetings, and will disseminate their research via blogs and articles intended for broader audiences.
The research funding supported the development of novel randomized algorithms for common matrix problems that are used in a wide variety of data analysis and machine learning applications. This involved theoretical analysis (proving the running time and quality of approximation properties of the algorithms) and high-quality numerical implementation of the algorithms (on a single machine as well as in parallel and distributed environments that are increasingly common in large-scale applications). The software produced includes code for least-squares regression, least absolute deviations, and quantile regression. The least-squares algorithm (LSRN) can handle millions of observations and hundreds or thousands variables. The least absolute deviations algorithm comes in two variants: one that is faster in the random access model and designed to compute low-precision solutions, and one that is designed for distributed environments to compute medium-precision solutions. The intellectual merit of the work involves an improved understanding of randomization in matrix algorithms, an improved understanding of resource constraints in machine learning algorithms, and novel connections between traditional optimization methods, numerical analysis methods, theoretical computer science methods, and large-scale data analysis and machine learning methods. The broader impact of the work involves the ability to analyze high volumes of data in areas such as astronomy, biomedicine, brain imaging, and internet analysis.