A key task in information theory is to characterize fundamental performance limits in compression, communication, and more general operational problems involving the storage, transmission and processing of information. Such characterizations are usually in terms of information measures, among the most fundamental of which are the Shannon entropy and the mutual information. In addition to their prominent operational roles in the traditional realms of information theory, information measures have found numerous applications in many statistical modeling and machine learning tasks. Various modern data-analytic applications deal with data sets naturally viewed as samples from a probability distribution over a large domain. Due to the typically large alphabet size and resource constraints, the practitioner contends with the difficulty of undersampling in applications ranging from corpus linguistics to neuroscience. One of the main goals of this project is the development of a general theory based on a new set of mathematical tools that will facilitate the construction and analysis of optimal estimation of information measures on large alphabets. The other major facet of this project is the incorporation of the new theoretical methodologies into machine learning algorithms, thereby significantly impacting current real-world learning practices. Successful completion of this project will result in enabling technologies and practical schemes - in applications ranging from analysis of neural response data to learning graphical models - that are provably much closer to attaining the fundamental performance limits than existing ones. The findings of this project will enrich existing big data-analytic curricula. A new course dedicated to high-dimensional statistical inference that addresses estimation for large-alphabet data in depth will be created and offered. Workshops on the themes and findings of this project will be organized and held at Stanford and UIUC.
A comprehensive approximation-theoretic approach to estimating functionals of distributions on large alphabets will be developed via computationally efficient procedures based on best polynomial approximation, with provable essential optimality guarantees. Rooted in the high-dimensional statistics literature, our key observation is that while estimating the distribution itself requires the sample size to scale linearly with the alphabet size, it is possible to accurately estimate functionals of the distribution, such as entropy or mutual information, with sub-linear sample complexity. This requires going beyond the conventional wisdom by developing more sophisticated approaches than maximal likelihood (?plug-in?) estimation. The other major facet of this project is translating the new theoretical methodologies into highly scalable and efficient machine learning algorithms, thereby significantly impacting current real-world learning practices and significantly boosting the performance in several of the most prevalent machine learning applications, such as learning graphical models, that rely on mutual information estimation.