While modern datasets are very large, the amount of information per variable is often relatively small. This includes datasets from genomics, social networks, and many applications in machine learning and artificial intelligence. For instance, in genomics we often track hundreds of thousands of genes, but only have a few hundred independent samples for each one. Similarly, online social networks are massive, but the structure of friendships only gives us a relatively small amount of data per individual. This kind of data is called "high-dimensional", and poses new challenges for mathematics, statistics, and computer science, especially when (as with all real data) they are noisy or incomplete. This project will identify exactly when and how it is mathematically possible to find patterns in these massive but noisy datasets, giving scientists across many fields a useful guide to how much data they need to draw reliable conclusions, and to develop new algorithms that will solve modern data science problems efficiently and optimally.

Through the study of community detection, noisy graph isomorphism, and matrix/tensor factorization, this project will develop a general framework to 1) locate the information-theoretic limit below which the observation is too noisy to detect the underlying pattern, or even to tell if a pattern exists; 2) devise efficient algorithms that succeed all the way down to the lowest possible signal-to-noise ratio; 3) prove that important classes of algorithms need super-polynomial time in certain hard regimes.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1838251
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2018-10-01
Budget End
2021-09-30
Support Year
Fiscal Year
2018
Total Cost
$737,645
Indirect Cost
Name
Santa Fe Institute
Department
Type
DUNS #
City
Santa Fe
State
NM
Country
United States
Zip Code
87501