The broad, long-term objective of this project concerns the development of novel statistical methods and computational tools for statistical and probabilistic modeling of large-scale multiple genomics data motivated by important biological questions and experiments. New high-throughput technologies and next generation sequencing are generating various types of very high-dimensional genomic and proteomic data and metadata (e.g., networks and pathways databases) in order to obtain a systems-level understanding of various complex phenotypes. As the amount and complexity of the data increases and as the questions being addressed become more sophisticated, statistical analysis methods that can integrate these genomic data and in the meanwhile can incorporate information about gene function and pathways into analysis of numerical vector/matrix data are required in order to draw valid statistical and biological inferences.
The specific aims of the current project are to develop new statistical models and methods for integrative analysis of genomic data in the context of pathways and networks. Motivated by analysis of genetic genomics data and diverse cancer genomic data, the first aim is to develop novel statistical methods for estimating genotype-adjusted precision matrix for a set of genes at the transcriptional levels. The resulting regression coefficient matrix and sparse precision matrix provide important information on gene regulation when the cis- and trans-genetic effects on gene expressions are adjusted.
The second aim i s to develop high dimensional instrumental variable regression for eQTL data analysis in order the identify the potential causal genes for a phenotype where the genome-wide genotypes are served as instrumental variables.
Aims 3 and 4 propose a set of new methods for gene set enrichment analysis, including methods for gene-set analysis by testing homogeneity of the covariance matrices and a class of multivariate random-set methods for integrative analysis of diverse genomic data. These methods hinge on novel integration of methods for high dimensional regression and high dimensional covariance matrix estimation and novel incorporation of prior functional gene sets and pathways. The new methods can be applied to different types of genomic data and will ideally help facilitate the identification of genes and their complex interactions as well as the biological pathways underlying various complex human diseases. The work proposed here will contribute statistical methodology to modeling high dimensional genomic data and to studying complex phenotypes and biological systems and offer insights into each of the biological areas represented by the various data sets. All programs developed under this grant and detailed documentation will be made available free-of-charge to interested researchers.

Public Health Relevance

This project aims to develop powerful statistical and computational methods for integrative analysis of diverse genomic data. The novel statistical methods are expected to gain more insights into how genomic perturbation and pathways dysfunction can lead to development of complex diseases such as neuroblastoma and human heart failure.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Biostatistics & Other Math Sci
Schools of Medicine
United States
Zip Code
Hu, Xiaowen; Feng, Yi; Zhang, Dongmei et al. (2014) A functional genomic approach identifies FAL1 as an oncogenic long noncoding RNA that associates with BMI1 and represses p21 expression in cancer. Cancer Cell 26:344-57
Vardhanabhuti, Saran; Jeng, X Jessie; Wu, Yinghua et al. (2014) Parametric modeling of whole-genome sequencing data for CNV identification. Biostatistics 15:427-41
Vardhanabhuti, Saran; Li, Mingyao; Li, Hongzhe (2013) A Hierarchical Bayesian Model for Estimating and Inferring Differential Isoform Expression for Multi-Sample RNA-Seq Data. Stat Biosci 5:119-137
Yin, Jianxin; Li, Hongzhe (2013) Adjusting for High-dimensional Covariates in Sparse Precision Matrix Estimation by ýýý1-Penalization. J Multivar Anal 116:365-381
Chen, Jun; Bushman, Frederic D; Lewis, James D et al. (2013) Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostatistics 14:244-58
Zhi, Wei; Minturn, Jane; Rappaport, Eric et al. (2013) Network-based analysis of multivariate gene expression data. Methods Mol Biol 972:121-39
Li, Hongzhe (2013) Systems biology approaches to epidemiological studies of complex diseases. Wiley Interdiscip Rev Syst Biol Med 5:677-86
Jeng, X Jessie; Cai, T Tony; Li, Hongzhe (2013) Simultaneous Discovery of Rare and Common Segment Variants. Biometrika 100:157-172
Daye, Z John; Chen, Jinbo; Li, Hongzhe (2012) High-Dimensional Heteroscedastic Regression with an Application to eQTL Data Analysis. Biometrics 68:316-326

Showing the most recent 10 out of 20 publications