A fundamental problem in analyzing big data is to extract and represent the relations among the huge number of variables in a dataset. For example, in a genomic dataset, one may want to find out the dependence among a large number of genetic variations and various disease states. The Bayesian network is a commonly used class of mathematical models to represent such complex relations among a collection of variables, with wide applications in many scientific fields, ranging from the biomedical sciences to the social sciences. The goal of this project is to develop statistical and machine learning methods to construct Bayesian networks from big data, where the datasets may contain thousands to millions of variables. This is a challenging problem, particularly for large networks, as seen from the fact that state-of-the-art methods can barely handle thousands of variables. In this project, a novel divide-and-conquer approach will be developed and implemented as open-source packages for public use. The PIs will also study the theoretical properties of key components of this approach. Through seminar organization and educational activities in both graduate and undergraduate training, the cutting-edge research in this project will be communicated immediately to a much broader audience.

The proposed approach consists of three main components: Partition, Estimation and Fusion (PEF). In the partition stage, spectral clustering will be embedded into an iterative subsampling approach to efficiently group variables into clusters. In the estimation stage, a few new methods will be developed to estimate the structure of a Bayesian network for each cluster of nodes, which serves as a subgraph of the big network. These methods include convex relaxations for permutations, fast algorithms for large-scale regularized estimation of the parameters of a Bayesian network, and novel formulations for discrete data. The final fusion stage will merge subgraphs into one big Bayesian network via a new method based on multiple-response sparse regression. Rigorous analysis of the PEF learning strategy for Bayesian networks under high-dimensional scaling will be conducted to provide theoretical guarantees for the methods and the algorithms.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1546098
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2015-10-01
Budget End
2020-09-30
Support Year
Fiscal Year
2015
Total Cost
$919,305
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095