Network data have become common in a wide range of fields, and a large and diverse community of researchers have studied various aspects of networks, yet statistical methods are rarely applied. This project proposes new theory, methodology, and algorithms that take a principled statistical approach to these problems, assess uncertainty, and establish conditions for desirable properties such as consistency. The focus is primarily on discovering community structure in networks, a common phenomenon in practice and a fundamental question in network analysis. New pseudo-likelihood algorithms are proposed for fitting the block model for networks, as well as several generalizations that allow for non-uniform degree distribution within blocks, removing the main limitation of the classic block model. The pseudo-likelihood based on aggregated data substantially speeds up computation, allowing fitting these models to larger and sparser networks than previously possible. The asymptotic distribution of criteria used for community detection is also studied, which leads to development of significance tests for community structure, consistency conditions, and asymptotically correct partition thresholds, which have important practical implications. New, more robust criteria are also proposed, consistent under weaker conditions. The proposal also develops a formal non-parametric test for comparing two networks, a problem that arises frequently in practice but is currently addressed only through informal comparisons of summary statistics. Finally, covariates on nodes and edges are incorporated into the models and used for predicting unobserved links in the networks. Many of the proposed methods provide the first statistical solutions to the corresponding network problems.
Development of statistical methods for community detection in networks, while contributing to the development of core statistical theory and methodology, has direct impact on the interdisciplinary field of network analysis and the study of complex networks. The applications of these are wide-spread, covering such diverse areas as infectious disease modeling, national security, communications, sociology, and genomics. The new statistical tools proposed take a more formal, rigorous approach, and have the potential to change how many scientists approach network analysis.