The rapidly increasing sizes of the data sets being treated in various applications is starting to render inadequate many of the traditional methods used in data exploration. These methods break down not only because of the increase in the number of observations (size of datasets themselves) but also because the underlying phenomena observed are intrinsically of high dimension, i.e., they involve a large number of variables or parameters. High-dimensional datasets present great mathematical challenges but in practice, the related difficulties are mitigated by the fact that in most cases, not all the measured variables are important for an understanding of the underlying phenomenon. One of the difficulties faced by current dimension reduction techniques is that existing algorithms are often too costly when dealing with very large data sets. To tackle a few of these challenges the research team of this project will focus on the development of computationally efficient methods which blend classical techniques such as PCA or LLE, with other strategies from numerical linear algebra and approximation theory, to reduce problem sizes. Thus, multilevel or divide and conquer techniques are quite common in other areas of scientific computing but received relatively little attention in data mining. The proposed work will put methods of this type at the forefront. The research team will also consider tools borrowed from graph theory, specifically techniques based on hypergraphs, graph partitioning, and kNN graph construction, to help with dimension reduction. Finally, this project will address the complex issue of dimension reduction by means of tensors and the use of multilinear algebra.
Society is currently facing an unparalleled surge of exploitable information in scientific, engineering, and economical applications. Typical examples of such applications include face-recognition which has uses in security and commerce for example, and the processing of queries on the world-wide web. The data sets generated in these applications are not only gaining in size (more data samples) but also in their dimension (number of parameters or variables to represent each data sample). For example, in face recognition, where one deals with sets of pictures the size would be the number of pictures and the dimension would be the number of pixels used to represent each picture. Reducing the dimension of data is a vital tool used in applications dealing with large data sets. It is therefore not too surprising that this line of research has gained enormous importance in the last few years. The investigators of this project will explore methods to solve this problem, putting an emphasis on those methods characterized by low computational cost. Among these methods are a class of divide and conquer techniques which divide the sets in smaller ones on which the classical methods are applied independently.