Collections of distributions arise naturally when analyzing large data sets. Since it is impractical to store all but a small fraction of such data, distributional representations are typically used to summarize the data in compact form. For example, a document in a corpus is typically represented by a normalized vector of frequencies of occurrence of keywords, an image is represented by a histogram over gradient features and speech signals are represented by spectral densities over a frequency domain.
Representing data sets as collections of distributions enables analysis via powerful concepts from statistics, learning theory and information theory. Concepts like strength of belief, information content, and pattern likelihood are used to extract meaning and structure from the data and are quantified using information measures like the Kullback-Leibler distance and its parent class, the Bregman divergences.
These measures capture meaning in data in a manner that traditional metrics cannot, by connecting abstract notions of information loss and transfer with concrete geometric notions like distances. However, they lack properties like symmetry and the triangle inequality that are essential requirements for the application of traditional geometric algorithms for data analysis.
In this project, the PI will develop a systematic, rigorous and global algorithmic framework for manipulating these distances. This framework will provide the foundation for efficient and accurate data analysis of spaces of distributions, and will lead to deeper insights into analysis problems across a wide range of applications.