It is an irony of our time that despite living in the 'information age' we are often data-limited. After decades of research, scientists still debate the causes and effects of climate change, and recent work has shown that a significant fraction of the most influential medical studies over the past 13 years have been subsequently found to be inaccurate, largely due to insufficient data. One reason for this apparent paradox is that modeling complex, real-world information sources requires rich probabilistic models that cannot be accurately learned even from very large data sets. On a deeper level, research inherently resides at the edge of the possible, and seeks to address questions that available data can only partially answer. It is therefore reasonable to expect that we will always be data-limited.
This research involves developing new algorithms and performance bounds for data-limited inference. Prior work of the PIs has shown that, by taking an information-theoretic approach, one can develop new algorithms that are tailored specifically to the data-limited regime and perform better than was previously known, and in some cases are provably optimal. This project advances the goal of developing a general theory for data-limited inference by considering a suite of problems spanning multiple application areas, such as classification; determining whether two data sets were generated by the same distribution or by different distributions; distribution estimation from event timings; entropy estimation; and communication over complex and unknown channels. Whereas these problems have all been studied before in isolation, prior work of the PIs has shown it is fruitful to view them as instances of the same underlying problem: data-limited inference.