The broad, long-term objective of this project concerns the development of novel statistical methods and computational tools for statistical and probabilistic modeling of large-scale multiple genomics data motivated by important biological questions and experiments. New high-throughput technologies and next generation sequencing are generating various types of very high-dimensional genomic and proteomic data and metadata (e.g., networks and pathways databases) in order to obtain a systems-level understanding of various complex phenotypes. As the amount and complexity of the data increases and as the questions being addressed become more sophisticated, statistical analysis methods that can integrate these genomic data and in the meanwhile can incorporate information about gene function and pathways into analysis of numerical vector/matrix data are required in order to draw valid statistical and biological inferences.
The specific aims of the current project are to develop new statistical models and methods for integrative analysis of genomic data in the context of pathways and networks. Motivated by analysis of genetic genomics data and diverse cancer genomic data, the first aim is to develop novel statistical methods for estimating genotype-adjusted precision matrix for a set of genes at the transcriptional levels. The resulting regression coefficient matrix and sparse precision matrix provide important information on gene regulation when the cis- and trans-genetic effects on gene expressions are adjusted.
The second aim i s to develop high dimensional instrumental variable regression for eQTL data analysis in order the identify the potential causal genes for a phenotype where the genome-wide genotypes are served as instrumental variables.
Aims 3 and 4 propose a set of new methods for gene set enrichment analysis, including methods for gene-set analysis by testing homogeneity of the covariance matrices and a class of multivariate random-set methods for integrative analysis of diverse genomic data. These methods hinge on novel integration of methods for high dimensional regression and high dimensional covariance matrix estimation and novel incorporation of prior functional gene sets and pathways. The new methods can be applied to different types of genomic data and will ideally help facilitate the identification of genes and their complex interactions as well as the biological pathways underlying various complex human diseases. The work proposed here will contribute statistical methodology to modeling high dimensional genomic data and to studying complex phenotypes and biological systems and offer insights into each of the biological areas represented by the various data sets. All programs developed under this grant and detailed documentation will be made available free-of-charge to interested researchers.

Public Health Relevance

This project aims to develop powerful statistical and computational methods for integrative analysis of diverse genomic data. The novel statistical methods are expected to gain more insights into how genomic perturbation and pathways dysfunction can lead to development of complex diseases such as neuroblastoma and human heart failure.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Biostatistics & Other Math Sci
Schools of Medicine
United States
Zip Code
Chen, Eric Z; Bushman, Frederic D; Li, Hongzhe (2017) A Model-Based Approach For Species Abundance Quantification Based On Shotgun Metagenomic Data. Stat Biosci 9:13-27
Shi, Pixu; Li, Hongzhe (2017) A model for paired-multinomial data and its application to analysis of data on a taxonomic tree. Biometrics 73:1266-1278
Liao, Katherine P; Sparks, Jeffrey A; Hejblum, Boris P et al. (2017) Phenome-Wide Association Study of Autoantibodies to Citrullinated and Noncitrullinated Epitopes in Rheumatoid Arthritis. Arthritis Rheumatol 69:742-749
Zhao, Sihai Dave; Cai, T Tony; Li, Hongzhe (2017) Optimal detection of weak positive latent dependence between two sequences of multiple tests. J Multivar Anal 160:169-184
Cai, T Tony; Li, Hongzhe; Liu, Weidong et al. (2016) Joint Estimation of Multiple High-dimensional Precision Matrices. Stat Sin 26:445-464
Cai, T Tony; Zhang, Anru (2016) Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data. J Multivar Anal 150:55-74
Cai, Tianxi; Cai, T Tony; Zhang, Anru (2016) Structured Matrix Completion with Applications to Genomic Data Integration. J Am Stat Assoc 111:621-633
Cai, T Tony; Zhang, Anru (2016) Inference for High-dimensional Differential Correlation Matrices. J Multivar Anal 143:107-126
Cai, T Tony; Liu, Weidong (2016) Large-Scale Multiple Testing of Correlations. J Am Stat Assoc 111:229-240
Chen, Eric Z; Li, Hongzhe (2016) A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. Bioinformatics 32:2611-7

Showing the most recent 10 out of 58 publications