Many scientific and engineering endeavors call for estimating probabilities and distributions based on an observed data sample. When the sample size is large relative to the number of possible outcomes, for example when a biased coin is tossed many times, estimation is simple. However, in many applications the number of possible outcomes is large compared to the sample size. For example, in language modeling - used in compression, speech recognition, and data mining - the number of words and contexts is large compared to the amount of text at hand. Estimation in this large-alphabet regime is much more complex, and has been studied for over two centuries. While some good estimators have been derived, for example those named after Laplace, Krichevsky-Trofimov, and Good-Turing, very few optimality properties have been established for them, and each is known to perform poorly under some conditions.

Adopting an information-theoretic viewpoint, the investigators undertake a systematic study of these issues. They concentrate on two broad problems, concerning the estimation of: (1) the probability of each observed outcome and of the collection of outcomes not yet observed; (2) the underlying distribution, which does not associate probabilities with specific outcomes. For each problem they seek estimation algorithms that perform well in practice and have provable optimality properties such as small Kullback-Leibler divergence and other metrics between the underlying and estimated distributions. The problems they address are both theoretical, for example the data size required to estimate the underlying distribution to within a given confidence level, and computational, regarding the complexity and sequentiality of the derived algorithms.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
0514973
Program Officer
William H Tranter
Project Start
Project End
Budget Start
2005-07-01
Budget End
2009-06-30
Support Year
Fiscal Year
2005
Total Cost
$262,974
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093