The amount of information that is becoming available every day is expanding at an exponential rate and this is starting to render ineffective standard techniques for data exploration. Though high-dimensional datasets present great mathematical challenges, in practice the related difficulties are mitigated by the fact that not all the measured variables are important for an understanding of the underlying phenomenon. The "dimensionality reduction techniques" considered in this project address this issue. Their principle is to combine the variables into a smaller set onto which the data is projected before attempting a solution of the original problem.

The problem of dimension reduction gives rise to many interesting mathematical and algorithmic challenges to linear algebra specialists. In particular, one of the difficulties faced by current techniques is that existing algorithms are often too costly when dealing with very large data sets. For example, a number of methods are based on a form of Principal Component Analysis (PCA) which becomes exceedingly expensive as the number of variables (features) and the number of samples increase. A similar calculation is also required for graph-based approaches such as the Locally Linear Embedding (LLE), or Laplacean eigenmaps. In addition, in many applications data sets are frequently updated, e.g., by adding or deleting data items, a situation for which standard matrix algorithms are not adapted. The proposed work will tackle a few of these challenges. It will focus on the development of computationally efficient dimension reduction methods and related techniques, by exploiting ideas from computational linear algebra. Multilevel or divide and conquer techniques are quite common in other areas of scientific computing but have received relatively little attention in data mining. The proposed work puts methods of this type at the forefront.

One of the broader impacts of the proposed work is that it will help promote interest in problems related to the current information revolution because its research theme blends mathematical methods, good algorithmic practices, and applications requiring effective numerical methods. The applications under consideration in this work are all of great relevance to many of the new challenges of society (social networks, commerce, and security). Finally, the software resulting from this research will be broadly disseminated to join an excellent pool of existing web sites that provide tools and repositories related to data exploration.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1318597
Program Officer
Jack S. Snoeyink
Project Start
Project End
Budget Start
2013-09-01
Budget End
2017-08-31
Support Year
Fiscal Year
2013
Total Cost
$348,386
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455