Collections of distributions arise naturally when analyzing large data sets. Since it is impractical to store all but a small fraction of such data, distributional representations are typically used to summarize the data in compact form. For example, a document in a corpus is typically represented by a normalized vector of frequencies of occurrence of keywords, an image is represented by a histogram over gradient features and speech signals are represented by spectral densities over a frequency domain.

Representing data sets as collections of distributions enables analysis via powerful concepts from statistics, learning theory and information theory. Concepts like strength of belief, information content, and pattern likelihood are used to extract meaning and structure from the data and are quantified using information measures like the Kullback-Leibler distance and its parent class, the Bregman divergences.

These measures capture meaning in data in a manner that traditional metrics cannot, by connecting abstract notions of information loss and transfer with concrete geometric notions like distances. However, they lack properties like symmetry and the triangle inequality that are essential requirements for the application of traditional geometric algorithms for data analysis.

In this project, the PI will develop a systematic, rigorous and global algorithmic framework for manipulating these distances. This framework will provide the foundation for efficient and accurate data analysis of spaces of distributions, and will lead to deeper insights into analysis problems across a wide range of applications.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0953066
Program Officer
Dmitry Maslov
Project Start
Project End
Budget Start
2010-02-01
Budget End
2015-01-31
Support Year
Fiscal Year
2009
Total Cost
$390,420
Indirect Cost
Name
University of Utah
Department
Type
DUNS #
City
Salt Lake City
State
UT
Country
United States
Zip Code
84112