The investigators and their students study the development, implementation and application of iterative search procedures for unsupervised exploratory data analysis. In particular, they develop statistically principled procedures for discovering patterns in high dimensional data, including biclustering and correlation mining of genomic data, and community detection in complex networks arising in computational sociology and public policy. Complementing the methodological component of the research, the investigators and their students also study the development of general theoretical tools to analyze iterative data mining procedures, and the properties of their associated local optima. They develop probabilistic tools, including new variants of Stein's method for normal approximation and new Gaussian comparison theorems, to understand asymptotic properties of typical local optima, and the dependence of these optima under different assumptions on the underlying signal, beginning with the null setting in which only noise is present. Their research is carried out in the context of ongoing collaborations with UNC faculty in the Medical School, and in the Departments of Genetics, Public Policy, and Mathematics.
The broad subject of the proposal is the development, theoretical analysis, and application of exploratory methods for large data sets. By exploratory methods, we mean those that search large data sets for significant patterns or configurations that may be of organizational or scientific interest. Examples include patterns that may distinguish types of a disease, that help target a drug or assess its efficacy, and patterns that identify among a large number of people a smaller community who frequently exchange text messages. In many cases, a numerical score is used to assess the potential importance of a pattern, and attention then turns to finding a pattern with a large score. Our primary interest is in search procedures that begin with a candidate pattern, then search for closely related patterns in the data that have higher score, repeating this procedure until they reach a pattern where no further (local) improvements are possible. Procedures of this sort are routinely applied in large data problems where finding the ``best'' pattern (the pattern with the largest score) is computationally prohibitive. We are developing and applying new, statistically based search procedures for several important tasks arising in the exploratory analysis of large data sets, including data mining and community detection. At the same time, we are developing fundamental theory to justify and inform the application of the iterative search procedures. Our work is being carried out in the context of ongoing collaborations with UNC faculty in the Medical School, and in the Departments of Genetics, Public Policy, and Mathematics.