The broad, long-term objective of this project concerns the development of novel statistical methods and computational tools for statistical and probabilistic modeling of large-scale multiple genomics data motivated by important biological questions and experiments. New high-throughput technologies and next generation sequencing are generating various types of very high-dimensional genomic and proteomic data and metadata (e.g., networks and pathways databases) in order to obtain a systems-level understanding of various complex phenotypes. As the amount and complexity of the data increases and as the questions being addressed become more sophisticated, statistical analysis methods that can integrate these genomic data and in the meanwhile can incorporate information about gene function and pathways into analysis of numerical vector/matrix data are required in order to draw valid statistical and biological inferences.
The specific aims of the current project are to develop new statistical models and methods for integrative analysis of genomic data in the context of pathways and networks. Motivated by analysis of genetic genomics data and diverse cancer genomic data, the first aim is to develop novel statistical methods for estimating genotype-adjusted precision matrix for a set of genes at the transcriptional levels. The resulting regression coefficient matrix and sparse precision matrix provide important information on gene regulation when the cis- and trans-genetic effects on gene expressions are adjusted.
The second aim i s to develop high dimensional instrumental variable regression for eQTL data analysis in order the identify the potential causal genes for a phenotype where the genome-wide genotypes are served as instrumental variables.
Aims 3 and 4 propose a set of new methods for gene set enrichment analysis, including methods for gene-set analysis by testing homogeneity of the covariance matrices and a class of multivariate random-set methods for integrative analysis of diverse genomic data. These methods hinge on novel integration of methods for high dimensional regression and high dimensional covariance matrix estimation and novel incorporation of prior functional gene sets and pathways. The new methods can be applied to different types of genomic data and will ideally help facilitate the identification of genes and their complex interactions as well as the biological pathways underlying various complex human diseases. The work proposed here will contribute statistical methodology to modeling high dimensional genomic data and to studying complex phenotypes and biological systems and offer insights into each of the biological areas represented by the various data sets. All programs developed under this grant and detailed documentation will be made available free-of-charge to interested researchers.

Public Health Relevance

This project aims to develop powerful statistical and computational methods for integrative analysis of diverse genomic data. The novel statistical methods are expected to gain more insights into how genomic perturbation and pathways dysfunction can lead to development of complex diseases such as neuroblastoma and human heart failure.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-HDM-T (90))
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Biostatistics & Other Math Sci
Schools of Medicine
United States
Zip Code
Cai, T Tony; Zhang, Anru (2016) Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data. J Multivar Anal 150:55-74
Cai, T Tony; Liu, Weidong (2016) Large-Scale Multiple Testing of Correlations. J Am Stat Assoc 111:229-240
Cai, T Tony; Zhang, Anru (2016) Inference for High-dimensional Differential Correlation Matrices. J Multivar Anal 143:107-126
Lin, Wei; Feng, Rui; Li, Hongzhe (2015) Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics. J Am Stat Assoc 110:270-288
Jeng, Jessie; Wu, Qian; Li, Hongzhe (2015) A Statistical Method for Identifying Trait-Associated Copy Number Variants. Hum Hered 79:147-56
Wu, Qian; Won, Kyoung-Jae; Li, Hongzhe (2015) Nonparametric Tests for Differential Histone Enrichment with ChIP-Seq Data. Cancer Inform 14:11-22
Zsiros, Emese; Duttagupta, Priyanka; Dangaj, Denarda et al. (2015) The Ovarian Cancer Chemokine Landscape Is Conducive to Homing of Vaccine-Primed and CD3/CD28-Costimulated T Cells Prepared for Adoptive Therapy. Clin Cancer Res 21:2840-50
Cai, Tony; Ma, Zongming; Wu, Yihong (2015) Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices. Probab Theory Relat Fields 161:781-815
Kelly, Brendan J; Gross, Robert; Bittinger, Kyle et al. (2015) Power and sample-size estimation for microbiome studies using pairwise distances and PERMANOVA. Bioinformatics 31:2461-8
Li, Yun R; Li, Jin; Zhao, Sihai D et al. (2015) Meta-analysis of shared genetic architecture across ten pediatric autoimmune diseases. Nat Med 21:1018-27

Showing the most recent 10 out of 50 publications