With the rapid advances in data collection technology, relational data is becoming increasingly important in modern sciences. Broadly speaking, relational data records interactions and dependences among actors in a population of interest. A typical example is network data, such as social networks, world-wide-web, and terrorist networks, where each actor is represented by a node in the network and an interaction between two actors is represented by the presence of an edge between the corresponding nodes. Another common form of relational data is covariance and correlation data, which summarizes pairwise dependence among actors, such as gene-gene co-expression, functional correlation in brain imaging, and spatial correlation in atmospheric and oceanographic measurements. Such data sets often contain important structures that can provide key insights to the population of interest. For example, the population of actors in a network data may be divided into several communities with different connectivity patterns; the population in a correlation data may contain a few important actors that account for most of the observed variability. However, the high dimensionality and complex dependence structure in these data sets make it a challenging statistical problem to recover these hidden structures.

This research project aims at advancing the theory and methodology in statistical inference for network and covariance data using spectral and principal components analysis. These two topics are brought together and studied using a novel set of tools recently developed in random matrix theory, spectral analysis, and empirical process theory. This project will investigate three topics. The first topic is a better understanding and refinement of community recovery in sparse network models using spectral clustering, one of the most popular methods in the literature and in practice. The second topic is network community detection in a statistical minimax framework, including information-theoretic lower bounds quantified by a comprehensive collection of model parameters, and optimal estimation procedures that achieve the lower bounds. The third topic is goodness-of-fit tests for general sparse principal components analysis models, where adaptive procedures will be developed using a detection boundary framework; and the high dimensionality challenge will be tackled by considering regular alternatives such as Sobolev ellipsoids.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1407771
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2014-08-01
Budget End
2017-07-31
Support Year
Fiscal Year
2014
Total Cost
$120,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213