The analysis of large high-dimensional data sets and graphs is motivated by many important applications, such as the study of databases of images and documents, and the modeling of complex dynamical systems (e.g. transaction data, weather patterns, molecular dynamics). This research involves the development of novel mathematical techniques for extracting and visualizing information from large data sets. The data layout, visualization, and human interaction are centered around multi-scale representations, which make it possible to access the data, the derived information and the inference processes associated with it at multiple levels of resolution. The human interaction affects both the geometry and the inference processes on the data, depending on the task at hand. The successful development of these techniques will have substantial impact on any application data which lends itself to a graph representation, such as citation networks, social networks, transaction data correlations, and many aspects of biological systems like gene expression and metabolic pathways. It will also reveal new and interesting multiscale geometric structures of high-dimensional data sets and graphs, and lead to a better understanding of how to extract information from them. This research develops novel multiscale embedding techniques and algorithms for graphs and data sets, based on diffusion processes on graphs. Such processes are used to generate multiscale embeddings of a graph, at different scales, as well as to perform learning tasks, with and without human interaction. These multiscale embeddings have strong quantitative guarantees in terms of metric distortion. At the same time, multiscale bases are constructed which have provable capabilities of sparsely representing functions on the graph, making them very well suited for both visualization and learning. We demonstrate the above on data sets ranging from gene networks to document corpora.

Project Report

Large amounts of data are being generated and collected in a wide variety of scientific and commercial settings: these include scientific simulations of large complex systems (e.g. weather, biomolecules, etc...), or customer databases collected during internet transactions . These data sets are typically very large in size (millions or billions of data points) and each data point is itself "high-dimensional", i.e. is a large collection of numbers (e.g. the weather pattern at a given time consists of large ensemble of numbers, including pressure, temperature and other characteristics, at many many locations). With the goal of extracting information from these data sets and performing predictions (e.g. in weather forecasting, or in view of recommending a useful product to a customer), it is necessary to analyze these data sets, both by mathematical algorithms and by visualizing, in a way that conveys information to us, and enables us to interact with the data to uncover patterns. In this project we introduced several novel techniques for analyzing and visualizing complex data sets. For example we show how to produce parsimonious representations of very high-dimensional data sets, by looking at an entire database "at multiple levels of resolution": at large scale the database may be highly compressed and represented, at the cost of accuracy of the representation, with a low-dimensional representation, say as a set of points in the plane. One may navigate and interact with the whole data set at this scale, and then decide to zoom into a particular region of the data set, obtaining a "higher-resolution" representation, of fewer datapoints (in the region of interest) but with greater accuracy. We implemented this navigation tool as a web application, usable even from mobile devices on a data connection with limited bandwidth, and enables the interaction with large data sets stored remotely in an efficient way. At the same time, we developed a mathematical theory for these representations, that for example describes the relationships between the size of the database, the accuracy achieved by these representations at different scales, and the number of bytes that need to transmitted to view a certain portion of data with a certain accuracy. In a similar vein, we constructed multiscale representations of graphs and networks - this is a type of data with its own structure (for example a network of computers, or a real-world or online social network) determined by the relationships between the actors. Large networks can be extremely complex and large, and we developed multiscale representations of such networks that simplify such large networks, in general at the expense of accuracy of the representation. In summary, we introduced novel ideas, algorithms, mathematical apparati, and code to study large data sets and large graphs in an efficient way by introducing the concept of multiscale analysis of such data types. These representations will be finding more applications, in particular modeling of data and graphs and their changes in time, machine learning tasks for prediction of attributes of the data, more efficient and scalable algorithms for analyzing and interacting with large data sets, and more.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
0808847
Program Officer
Lawrence Rosenblum
Project Start
Project End
Budget Start
2008-05-15
Budget End
2014-07-31
Support Year
Fiscal Year
2008
Total Cost
$791,095
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705