It is an irony of our time that despite living in the 'information age' we are often data-limited. After decades of research, scientists still debate the causes and effects of climate change, and recent work has shown that a significant fraction of the most influential medical studies over the past 13 years have been subsequently found to be inaccurate, largely due to insufficient data. One reason for this apparent paradox is that modeling complex, real-world information sources requires rich probabilistic models that cannot be accurately learned even from very large data sets. On a deeper level, research inherently resides at the edge of the possible, and seeks to address questions that available data can only partially answer. It is therefore reasonable to expect that we will always be data-limited.

This research involves developing new algorithms and performance bounds for data-limited inference. Prior work of the PIs has shown that, by taking an information-theoretic approach, one can develop new algorithms that are tailored specifically to the data-limited regime and perform better than was previously known, and in some cases are provably optimal. This project advances the goal of developing a general theory for data-limited inference by considering a suite of problems spanning multiple application areas, such as classification; determining whether two data sets were generated by the same distribution or by different distributions; distribution estimation from event timings; entropy estimation; and communication over complex and unknown channels. Whereas these problems have all been studied before in isolation, prior work of the PIs has shown it is fruitful to view them as instances of the same underlying problem: data-limited inference.

Project Report

It is well accepted that we live in an age marked by prevalence of data.Yet, paradoxically, many of the problems that society confronts arise not from having too much data but from having too little. Research in medicine, climate change, natural language processing, neuroscience, and many other areas is continually hindered by a lack of adequate data. A number of important controversies, from the effect on humankind's effect on the environment to the source of the rise of autism or gluten intolerance, could be easily settled by better data. This project considered how one can make better inference from limited data, in a sense "squeezing" as much information out of existing datasets as possible. Our prior work showed that traditional ways of making inference from data sets was unduly conservative. Using modern techniques, it is possible in some cases to make inferences with a high level of confidence that in some prior situations would have been viewed as impossible. The research focused in particular on natural language data sets. Automatically inferring authorship or subject matter by the relative frequency of different words in a document is notoriously difficult because even in large documents a large number of words appear rarely. Each word, taken invidually, does not exhibit a sufficiently large sample size to make inferences. This reseach showed, however, that taken together, a large number of rare words can be used to make authorship or subject matter decisions with a level of certainty that can be proven to be high mathematically. Using the same mathematical tools, we also proved mathematically that the existing ways in which certain wireless systems operate are optimal, so that it is fruitless to try to improve them. The project also led to a development of several course modules on modeling. Modeling has historically been an undertaught art in electrical engineering. As part of this module, we developed a text compression contest as part of a graduate information theory course. The students in the course are given a large sample of text, say from a novel, and their assignment is to build the shortest possible computer program that outputs the given text. This exercise is essentially one of modeling English text; the best models result in the most compression. The project also supported both a woman and an underrepresented minority student during their Ph.D. studies.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Cornell University
United States
Zip Code