A fundamental problem in analyzing big data is to extract and represent the relations among the huge number of variables in a dataset. For example, in a genomic dataset, one may want to find out the dependence among a large number of genetic variations and various disease states. The Bayesian network is a commonly used class of mathematical models to represent such complex relations among a collection of variables, with wide applications in many scientific fields, ranging from the biomedical sciences to the social sciences. The goal of this project is to develop statistical and machine learning methods to construct Bayesian networks from big data, where the datasets may contain thousands to millions of variables. This is a challenging problem, particularly for large networks, as seen from the fact that state-of-the-art methods can barely handle thousands of variables. In this project, a novel divide-and-conquer approach will be developed and implemented as open-source packages for public use. The PIs will also study the theoretical properties of key components of this approach. Through seminar organization and educational activities in both graduate and undergraduate training, the cutting-edge research in this project will be communicated immediately to a much broader audience.
The proposed approach consists of three main components: Partition, Estimation and Fusion (PEF). In the partition stage, spectral clustering will be embedded into an iterative subsampling approach to efficiently group variables into clusters. In the estimation stage, a few new methods will be developed to estimate the structure of a Bayesian network for each cluster of nodes, which serves as a subgraph of the big network. These methods include convex relaxations for permutations, fast algorithms for large-scale regularized estimation of the parameters of a Bayesian network, and novel formulations for discrete data. The final fusion stage will merge subgraphs into one big Bayesian network via a new method based on multiple-response sparse regression. Rigorous analysis of the PEF learning strategy for Bayesian networks under high-dimensional scaling will be conducted to provide theoretical guarantees for the methods and the algorithms.