Over the past decade, the precipitous drop in the cost of disk storage and the build-up of world-wide high-bandwidth fiber optic communications has made massive amounts of data of different modalities (text, images, video) easily available to everyone over the Web. In science, engineering, business, and medicine, high-bandwidth sensors, large-scale simulations, and data collection bots generate immense data sets that need to be analyzed. Making sense of all this disparate data in becoming increasingly challenging and difficult. Unlike traditional databases where data is carefully massaged to adhere to rigid schemata, much of the above data comes unstructured, is often dynamic rather than static, can contain large amounts of noise or even errors, and can be incomplete. This project aims to develop general, rigorous and efficient techniques for analyzing massive and distributed sets of unstructured data. The basic aim is to exploit certain ideas from computational topology and geometry in the study of the global structure of large, distributed data sets -- and especially to develop data representations and transformations that makes this structure more apparent. Topology studies the connectivity of spaces, so it is global by its very nature. It is able to determine certain connectivity invariants in a way that is unaffected by deformations of an object and does not require explicit parameterizations of the object geometry. Its strength lies, in a sense, in its relative insensitivity to geometric properties, which permits it to discern underlying combinatorial information about how the geometric object is constructed, and therefore detect some qualitative properties. This type of global analysis can be quite important in understanding the overall structure of data sets. Geometry, though more local by nature, can also be used to study global structure by discovering how parts of an object relate to another, or how parts of different objects can be similar. For example, the Erlanger program of Felix Klein has fueled for over a century mathematicians' interest in invariance under certain group actions as a key principle for understanding geometric spaces. Such invariances or symmetries can also be key to understanding and reasoning about data sets.

The methods proposed here can be applied in many different settings where massive unstructured data sets arise. In science or engineering, large-scale distributed simulations can produce immense data sets; as an example, consider the Folding@Home project at Stanford that generates protein folding trajectories using hundreds of thousands of CPUs throughout the world. In business, companies such as Google and Yahoo! have to mine billions of web clicks to develop algorithms for matching ads to web page content or to individual users. In medicine 3D imaging is becoming commonplace. Medical imaging diagnostic systems, distributed throughout medical offices nationwide, should be able to efficiently share information about shapes of organs and thereby collectively learn about whether certain variations are associated with different diagnostic outcomes or treatment successes. In all these cases, understanding the global structure of the data can provide valuable scientific, engineering, or medical insights, enabling better business decisions, or leading to more effective medical treatment planning.

Project Report

This project has developed geometric and topological techniques for understanding large data sets and their inter-relationships, with the aim of improving visual analytics techcniques. Most visual data, such as images, videos, 3D scans etc. naturally have a geometric character, but many other data sets as well can be better understood by embedding them first into a geometric space via various feature sets – for example, medical or biological micro-array data. The project developed novel methods for extracting structure from such data sets and for estimating maps and correspondences between them that can be used to propagate and cross-check this structure. Some key accomplishments include: The discovery of the Heat Kernel Signature (HKS) features for matching 3D shapes under isometric deformation --- these made possible much improved algorithms for matching 3D models of humans or animals in various poses, and more generally shapes whose deformations are approximately isometric. The development of zig-zag persistence, a much generalized version of persistent homology for topological data analysis that has led to a flurry of new activity in computational topology and its applications, to medical, astronomical, and geological data. The introduction of functional maps between data sets and of networks connecting data based on such maps. Functional maps are both easier to compute and more flexible than traditional maps, permitting the use of linear algebraic tools in a setting where before the problem we non-linear and even non-convex. A new way to mathematically describe and to visualize differences between data sets that can automatically illuminate the areas where distortions happen and to pinpoint the nature of the distortions. A key notion to emerge from the grant is the idea on building networks between related data sets and transporting information between them. This allows us to benefit from the "wisdom of the collection" when analyzing a specific data set (e.g., image segmentation) or when trying to find maps and correspondences between different data sets. By creating societies of data sets and their associations in a globally consistent way, we enable a certain joint understanding of data that provides the powers of abstraction, analogy, compression, error correction, and summarization — all keys to a deeper understanding of the data.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0808515
Program Officer
Tie Luo
Project Start
Project End
Budget Start
2008-07-01
Budget End
2014-08-31
Support Year
Fiscal Year
2008
Total Cost
$849,012
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Palo Alto
State
CA
Country
United States
Zip Code
94304