Advanced algorithms are an increasingly powerful tool to extract information from vast amount of data that are gathered over the Internet, by smartphones, sensors networks, or high-throughput scientific studies. As these methods become ubiquitous, it is crucial to understand their full potential. What kind of information can we hope to extract from a certain type of data? Viceversa, how much data should we accumulate in order to be able to infer a certain piece of information? What is the bottleneck that prevents us from extracting more information? These questions have been studied within classical statistics, but modern applications pose entirely new challenges and classical concepts are only partially useful. In particular, computational resources become a crucial bottleneck for modern datasets. In many cases, although the data contain in principle the information of interest, finding it is a needle-in-haystack problem, and cannot be done on human timescales. This project aims at characterizing these fundamental limitations in several central problems, and develop algorithms that can achieve those limits.
Both information theory and complexity theory fall short of capturing the fundamental limitations to statistical learning tasks. This project follows a different approach which aims at analyzing broad classes of algorithms, and draw connections between their behavior. More precisely, the project considers three such classes that essentially encompass most algorithms used nowadays: empirical risk minimization; semidefinite programming hierarchies; and local algorithms. The focus is on two concrete statistical estimation problems that are relevant for a number of applications: group synchronization on graphs; non-linear high-dimensional regression and classification. In these and analogous problems, the behavior of seemingly different types of algorithms is often surprisingly similar. Understanding the origin of this similarity and its implications is a key focus of this research.