Discrete structures like networks, clusters and ranking are observed in many real-life applications including brain networks, protein clusters, and portfolio selection. This project will develop a unified statistical framework for inferring these unknown discrete structures from large-scale datasets in biology, neuroscience, and finance. The replicability crisis on whether statistical inference for scientific discoveries can be trusted has drawn attention from both the scientific community and the public. A new field, "combinatorial inference", developed by the PI aims to fill the gap between various existing methods designed for continuous quantities and the lack of valid reproducibility assessment for discrete ones. This project will also train the next generation of data scientists to acquire statistical and computational skills for large-scale scientific data analysis.

This project aims to develop a unified inference framework to assess uncertainty (e.g. constructing confidence intervals, testing hypotheses and controlling the false discovery rates) when inferring discrete structures include networks, hypergraphs, clustering and ranking. Three types of problems will be considered: (1) Testing the hypothesis that the discrete structure has certain combinatorial properties (e.g., the network is connected, or the graph is triangle-free); (2) Constructing confidence sets covering the discrete quantities with given significance level (e.g., the maximum degree of a graph, the elements belong to a same cluster, top k ranking items); (3) Screening discrete structures of interest (e.g., the cycles, hubs and cliques in a graph) with the false discovery rate control. Classical inference mainly focuses on testing hypotheses and constructing confidence intervals for continuous parameters where analytical methods and theory can be directly applied. On the other hand, exiting research on discrete structures in statistical models mainly focuses on estimation and lacks systematic inferential methods for uncertainty assessment. This project seeks to advance the frontiers of modern statistical inference in four directions: (a) Methodology: Developing efficient methods for hypothesis testing, confidence interval construction, and false discovery controlling procedures; (b) Theory: Developing new probabilistic tools including non-asymptotic concentration inequalities and limiting theory and universality phenomena for combinatorial random quantities; (c) Computation: Computationally efficient algorithms to implement the combinatorial inferential methods for large-scale statistical models; and (d) Fundamental limits: Developing new combinatorial information-theoretic lower bounds to justify the optimality of the proposed inferential methods.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
United States
Zip Code