Drineas, Petros Rensselaer Polytech Inst

Intellectual Merit. Massive data sets are ubiquitous and making effective use of such data sets is essential for the advancement of todays science, engineering, and business worlds. This proposal concerns the development and analysis of novel algorithms for structuring and interrogating multimode, heterogeneous data sets modelled by multimode arrays (tensors), which arise in many applications. The proposed algorithms use random sampling to extract representative data elements from the data sets and, subsequently, they operate on the sample. In order to yield improved results for problems with heterogeneities and non-uniformities, importance sampling and other sophisticated sampling techniques are employed. This proposal contains three specific research directions (the first three years) and an early career plan (the last two years). First, it is proposed to address the wellstudied and broadly applicable 1 and 2 regression problems from a sampling perspective in order to prove that a small, judiciously chosen sample contains the information necessary to approximately solve the problem. Second, the above results will be extended to derive simple, intuitive matrix decompositions, of the form CUR, where C contains a few columns of the matrix, R contains a few rows of the matrix, and U is a carefully chosen matrix. CUR-type decompositions provide summaries of the data that are very appealing from a data interpretation perspective, since they express all columns of A as linear combinations of a small number of basis columns (the columns in C), which come directly from the data, and similarly for the rows. Third, CUR-type decompositions will be extended to tensors and a basis for dealing with complex, multimode, heterogeneous, tensor data sets exhibiting multilinear structure will be developed. In the last two years, it is proposed to focus on alleviating the assumption of linear degrees of freedom across the various modes of tensor data sets, and seek sampling techniques that extract non-linear degrees of freedom. This proposal will extend in significant ways the previous work of the PI on fast Monte Carlo algorithms for performing computations on large matrices. Broader impact. This work will have broader impact in fields as diverse as the Social Sciences, Bioinformatics and Computer Science. The PI has initiated collaborations with three labs and will apply the proposed techniques on their data. The PI in collaboration with Prof. Alter and her lab at UT Austin will apply and interpret CUR-type decompositions on cell-cycle gene expression data in an effort to identify physiological pathways that are active during the different phases of the cell cycle. The PI will also collaborate with Prof. Kidds lab at the Genetics Department of Yale University to apply and interpret CUR-type decompositions on genotyping data from samples from 42 populations from around the world. Finally, the PI will collaborate with M. Maggioni at the Yale Math Department and faculty at the Yale Medical School to apply the proposed tensor decompositions on colon cancer datacubes. These three collaborations will serve as proofs of concept for the practical applicability of these techniques, and will lead the PI towards the broader goal of developing a toolbox for tensor analysis for use by the multitude of researchers in fields where multimode data naturally arise. Finally, the PIs work will be of interest to motivated undergraduates with a wide range of interests. It is proposed to disseminate this knowledge to undergraduates by continuing our previously proven commitment to involving undergraduates, even in their sophomore year, in cutting edge research. Extra effort will be made to involve underrepresented groups, in collaboration with the aforementioned labs.

Project Report

Massive data sets are ubiquitous and making effective use of such data sets is essential for the advancement of modern science, engineering, and business. This award developed novel numerical algorithms to analyze the structure of multimode datasets that are represented by massive matrices or tensors and arise in many applications. The fundamental innovation of the proposed algorithms is their randomized nature: the algorithms use random sampling and random projections to extract representative features from the datasets and they subsequently operate on the sample. In the context of the project, the PI and collaborators developed a novel importance sampling probability distribution (the so-called leverage score sampling) and also analyzed numerous random-projection-based sketching approaches to summarize massive matrices and tensors. In more detail, first of all, novel algorithms for regression problems (with respect to all p-norms between one and infinity) were developed; it was proven that a small, judiciously chosen sample contains the information necessary to approximately solve the problem. Second, two new matrix decompositions were proposed and analyzed: the so-called CUR matrix decomposition, where C contains a few columns of the matrix, R contains a few rows of the matrix, and U is a carefully chosen matrix, and the CX matrix decomposition, with C as described above and X a carefully constructed matrix. Both decompositions are very appealing from a data interpretation perspective, since they express all columns of A as linear combinations of a small number of basis columns (the columns in C), which come directly from the data (and similarly for the rows). Third, the aforementioned decompositions were extended to a tensor-based framework with linear and other degrees of freedom. The results of this award had considerable broader impact in modern, massive datasets that emerge from population genetics applications. Our early work on ancestry informative markers leveraged the developed theory in order to look for specific DNA markers (known as single nucleotide polymorphisms, or SNPs) that can predict the ancestry of an individual. The resulting papers in journals such as Genome Research, PloS Genetics, the Proceedings of the National Academy of Sciences, etc. attracted considerable follow-up work by both practitioners and computer scientists.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0545538
Program Officer
Anita J. LaSalle
Project Start
Project End
Budget Start
2006-02-01
Budget End
2013-12-31
Support Year
Fiscal Year
2005
Total Cost
$442,844
Indirect Cost
Name
Rensselaer Polytechnic Institute
Department
Type
DUNS #
City
Troy
State
NY
Country
United States
Zip Code
12180