While modern datasets are very large, the amount of information per variable is often relatively small. This includes datasets from genomics, social networks, and many applications in machine learning and artificial intelligence. For instance, in genomics we often track hundreds of thousands of genes, but only have a few hundred independent samples for each one. Similarly, online social networks are massive, but the structure of friendships only gives us a relatively small amount of data per individual. This kind of data is called "high-dimensional", and poses new challenges for mathematics, statistics, and computer science, especially when (as with all real data) they are noisy or incomplete. This project will identify exactly when and how it is mathematically possible to find patterns in these massive but noisy datasets, giving scientists across many fields a useful guide to how much data they need to draw reliable conclusions, and to develop new algorithms that will solve modern data science problems efficiently and optimally.
Through the study of community detection, noisy graph isomorphism, and matrix/tensor factorization, this project will develop a general framework to 1) locate the information-theoretic limit below which the observation is too noisy to detect the underlying pattern, or even to tell if a pattern exists; 2) devise efficient algorithms that succeed all the way down to the lowest possible signal-to-noise ratio; 3) prove that important classes of algorithms need super-polynomial time in certain hard regimes.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.