The research proposed in this project is motivated by the following problem. In many genetic studies, in addition to gene expression data, other types f data are collected from the same individuals. The problem is how to make use of this additional information when construct gene networks. The investigators formulate this problem by a conditional Gaussian graphical model (CGGM), in which the external variables are incorporated as predictors. They propose an estimation procedure for this model by combining reproducing kernel Hilbert space with the lasso type regularization. The former is used to construct a model-free estimate of the conditional covariance matrix, and the latter is used to derive a sparse estimators of the conditional precision matrix, whose zero entry pattern correspond to a graph that describes the gene network. They propose to study the asymptotic properties, to introduce methods to determine the tuning constants, and to develop standardized and openly accessible computer programs for this model. Furthermore, the investigators propose to extend the CGGM in two directions. First, they propose to relax the Gaussian assumption by applying a copula transformation to the residuals and then using pseudo likelihood to estimate conditional correlations. These are then subject to the lasso-type regularization to yield sparse estimator of the precision matrix. The second direction is the development of sufficient graphical model, which is a mechanism to simultaneously reduce the dimension of the predictor and estimate the graphical structure of the response.

High-throughput technologies that enable researchers to collect and monitor information at the genome level have revolutionized the field of biology in the past fifteen years. These data offer unprecedented amount and diverse types of data that reveal different aspects of the biological processes. At the same time, they also present many statistical and computational challenges that cannot be addressed by traditional statistical methods. In current genomics research it has become increasingly clear that statistical analysis based on individual genes may incur loss of information on the biological process under study. For example, a widely known study on identifying genetic patterns of diabetic patients show that no single gene could stand out statistically as responsible for the patterns, and yet clear signals emerged when genes were analyzed in groups. Motivated by this observation, greater attention has been paid to networks of genes. The investigators propose a class of new statistical methods, called conditional graphical models, for constructing gene networks that can take into account of a set of covariates. They also plan to develop theoretical properties and computer programs for the proposed methods. Although their inquire began with gene networks, the investigators envision conditional graphical models to have broad applications beyond genomics, such as in predicting asset returns and in studying social networks, which are becoming all the more prevalent in this age of Internet.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1106738
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2011-07-01
Budget End
2015-06-30
Support Year
Fiscal Year
2011
Total Cost
$80,000
Indirect Cost
Name
Yale University
Department
Type
DUNS #
City
New Haven
State
CT
Country
United States
Zip Code
06520