Across many fields of science, engineering, and business, massive data sets are being generated at unprecedented rate by high-bandwidth sensors and cameras, large-scale simulations, or web-enabled large scale data collection. Much of this data has a geometric character, either directly or indirectly. For example, second generation LiDARs can map the earth's surface at 15-20 cm resolution; the Large Synoptic Telescope is set to produce about 30 terabytes of data each night; thirteen hours of video are uploaded to YouTube every minute; Facebook manages over 40 billion photos requiring more than one petabyte of data.

These data sets provide tremendous opportunities to enable novel capabilities that were unimaginable a few years ago. Capitalizing on these opportunities, however, and transforming these massive amounts of heterogeneous data into useful information for vastly different types of applications and users requires solving challenging algorithmic problems. An effective way of addressing this challenge is by designing efficient methods for producing informative yet succinct summaries of such geometric data sets. These summaries must work at multiple scales, and allow a wide variety of queries to be answered approximately but efficiently. The goal of this project is to study the theoretical underpinnings of compact representations and efficient algorithms for organizing, summarizing, cross-correlating, interlinking, and querying large distributed geometric data sets.

This project will design methods for computing summaries of many kinds of flavors, all with provable properties. Summaries can be combinatorial and metric (core sets and kernels), algebraic (linear sketches), topological (persistence diagrams), feature-based, and structural (encoding self-similarities in the data). The properties they aim to capture extend from low-level metric attributes, such as the diameter or width of a point set, to higher-level attributes revealing the internal structure of the data, as in the detection of symmetries and repeated patterns. This processing must be done in the presence of uncertainty in data coming from sensors, and optimize multiple performance measures, including communication cost for data distributed across multiple locations in a network. Another key aspect of this project is that it aims to understand not individual data sets in isolation but rather the inter-relationships and correspondences among different data sets, and to do so by communicating only summary information, without even having all the data in one place.

This work touches upon many topics in theoretical computer science and applied mathematics including low-distortion embeddings, compressive sensing, transportation metrics, spectral graph theory or harmonic analysis, machine learning, and computational topology.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
1012254
Program Officer
Tracy J. Kimbrel
Project Start
Project End
Budget Start
2010-09-01
Budget End
2015-08-31
Support Year
Fiscal Year
2010
Total Cost
$448,539
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705