Given samples from some unknown distribution, what can one infer about the underlying distribution, and how efficiently can these inferences be made?  In many of the most fundamental settings, our understanding of the computational and information theoretic possibilities and barriers is still startlingly poor.  This project tackles two broad research objectives: developing efficient algorithms for probing data, and understanding how to efficiently estimate properties of distributions.  The first line of research seeks to understand which questions about a dataset can be answered extremely efficiently, requiring computational resources (time, or memory) that are sublinear in the size of the dataset or distribution.  The second research objective is to understand the minimal amount of information necessary to ascertain, with high probability, whether or not a distribution or dataset possesses a given property.  In the context of statistical property estimation, this problem asks how few samples are needed to estimate the property in question to a desired accuracy, with high probability.  This research pursues both new estimation algorithms, and new information theoretic tools and lower bounds.

With vast and important datasets emerging across many disciplines, from genetic, biological, and medical databases, to databases documenting our economic and social behaviors, the challenge of how to make sense of them has particular immediate relevance and has rapidly become the bottleneck in scientific understanding.   The specific problems investigated in this project arise in the analysis of these datasets; algorithmic advances on these problems have the potential to very quickly be adopted and transform ongoing data analysis efforts.   Beyond the immediate implications for the data sciences, these questions are extremely basic and foundational. As such, new techniques, perspectives, and insights gleaned from their study are likely to have broad implications for other problems throughout computer science, statistics, information theory, and the data sciences.

Project Start
Project End
Budget Start
2014-07-01
Budget End
2019-06-30
Support Year
Fiscal Year
2013
Total Cost
$500,000
Indirect Cost
Name
Stanford University
Department
Type
DUNS #
City
Stanford
State
CA
Country
United States
Zip Code
94305