The research proposed in this project is motivated by the following problem. In many genetic studies, in addition to gene expression data, other types f data are collected from the same individuals. The problem is how to make use of this additional information when construct gene networks. The investigators formulate this problem by a conditional Gaussian graphical model (CGGM), in which the external variables are incorporated as predictors. They propose an estimation procedure for this model by combining reproducing kernel Hilbert space with the lasso type regularization. The former is used to construct a model-free estimate of the conditional covariance matrix, and the latter is used to derive a sparse estimators of the conditional precision matrix, whose zero entry pattern correspond to a graph that describes the gene network. They propose to study the asymptotic properties, to introduce methods to determine the tuning constants, and to develop standardized and openly accessible computer programs for this model. Furthermore, the investigators propose to extend the CGGM in two directions. First, they propose to relax the Gaussian assumption by applying a copula transformation to the residuals and then using pseudo likelihood to estimate conditional correlations. These are then subject to the lasso-type regularization to yield sparse estimator of the precision matrix. The second direction is the development of sufficient graphical model, which is a mechanism to simultaneously reduce the dimension of the predictor and estimate the graphical structure of the response.

High-throughput technologies that enable researchers to collect and monitor information at the genome level have revolutionized the field of biology in the past fifteen years. These data offer unprecedented amount and diverse types of data that reveal different aspects of the biological processes. At the same time, they also present many statistical and computational challenges that cannot be addressed by traditional statistical methods. In current genomics research it has become increasingly clear that statistical analysis based on individual genes may incur loss of information on the biological process under study. For example, a widely known study on identifying genetic patterns of diabetic patients show that no single gene could stand out statistically as responsible for the patterns, and yet clear signals emerged when genes were analyzed in groups. Motivated by this observation, greater attention has been paid to networks of genes. The investigators propose a class of new statistical methods, called conditional graphical models, for constructing gene networks that can take into account of a set of covariates. They also plan to develop theoretical properties and computer programs for the proposed methods. Although their inquire began with gene networks, the investigators envision conditional graphical models to have broad applications beyond genomics, such as in predicting asset returns and in studying social networks, which are becoming all the more prevalent in this age of Internet.

Project Report

Current genomics research indicates that statistical analysis based on individual genes may incur loss of information on the biological process under study. Better results can be derived from the analysis based on groups of genes, or gene networks. An informative characterization of a gene network is by the global Markov property, which can be inferred by the statistical graphical models. In many genetic studies, however, in addition to gene expression data, other types of data are collected from the same individuals. A main achievement of this project is the development of a new class of statistical graphical models that can effectively make use of this additional information. We introduced a flexible conditional Gaussian graphical model, in which the external variables are incorporated as predictors.We proposed an estimation procedure for this model by combining reproducing kernel Hilbert space withthe lasso type regularization. In a subsequent work, we broadened the conditional graphical model so it can incorporate genetic data from different sources. Another important outcome of this project is the development of nonparametric additive graphical models to relax the restrictive Gaussian distribution assumption, or the copula Gaussian assumption, of the currently used graphical models. Conditional independence is the basis of the current graphical models. It is a complicated statistical relation that is difficult to model. The modeling of conditional independence is greatly simplified under the Gaussian distribution assumption, but the Gaussian assumption is too strong and unrealistic for many network data we encounter in practice. To relax this assumption is a very challenging problem and a focal point of many recent research activities. We developed a new theory of additive conditional independence, which shares all the important properties of conditional independence without using the Gaussian assumption. Equipped with this new theory we developed a flexible graphical model that can be applied to non-Gaussian network data and that is relatively easy to program and compute. Apart from these accomplishments on graphical models and statistical networks, we have also made significant advances in sufficient dimension reduction, which is an integral part of this project. Sufficient dimension reduction is a powerful body of theories and methodologies to handle high-dimensional data, such as those arise in genetics and data mining. It is playing an increasingly important role in our age of big data. One of the outcomes of this project is the extension of sufficient dimension reduction to the nonlinear case. This greatly expands the scope and application of sufficient dimension reduction, rigorizes and unifies many recent results developed in Statistics and Machine Learning, and provides a general theoretical platform for developing new methods for handling large and high-dimensional data.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Pennsylvania State University
University Park
United States
Zip Code