ChIP-seq is a powerful technology to map genome-wide protein-DNA interactions (PDIs). It is increasingly used by scientists worldwide to study how gene activities are controlled in normal cells and why they are disrupted in diseases. Applying ChIP-seq to study gene regulation faces three major challenges: (1) how to analyze large ChIP-seq data sets to discover dynamic changes of gene regulation across different biological contexts, (2) how to infer global regulatory programs under the practical constraint that it is not feasible to conduct ChIP-seq for all transcription factors (TFs), and (3) how to analyze allele-specific events given the small amount of data at heterozygote SNPs which cause low statistical power. This study investigates novel statistical and computational solutions to address the challenges above. First, a new method will be developed to discover and characterize dynamic changes of gene regulation across different biological contexts. This method, Generalized Differential Principal Component Analysis (dPCA/GDPCA), integrates unsupervised pattern discovery, dimension reduction and statistical inference into a single statistical framework. It provides a systematic solution to analyze quantitative and curve shape changes in large ChIP-seq data sets involving multiple proteins. It is expected to have a wide range of applications. Second, a computational framework will be developed to predict global gene regulation dynamics, i.e., dynamic changes of downstream regulatory events of all TFs for which DNA binding motif information is available. The analysis integrates the dynamic changes of histone modification ChIP-seq, DNase-seq, and FAIRE-seq data with DNA sequences, public ChIP-seq, and public gene expression data. It will provide a practical, affordable, and reasonably accurate solution to utilizing ChIP-seq to study many TFs simultaneously. A systematic benchmark study will also be con- ducted to evaluate the impact of technologies, data types and analytical methods on prediction performance. This benchmark study will provide guidelines for designing informative future experiments. Third, a method for detecting allele-specific protein-DNA binding (ASB) will be developed. The method is able to integrate information from multiple ChIP-seq data sets and completely phased genome sequences to significantly improve the statistical power of ASB inference. Various sources of biases will also be handled. Guidelines and new analytical tools generated by this study will allow one to design informative ChIP-seq experiments in the future such that by collecting one set of ChIP-seq data, one can not only identify locations of PDIs, but also infer global dynamic changes of TF binding sites across different biological contexts, and, if genotype data are available, robustly analyze allele-specific gene regulation. This will make ChIP-seq a low-cost high-reward experiment that serves multiple purposes. By significantly expanding the utility and increasing the power of ChIP-seq, our computational infrastructure is expected to have a major impact on advancing future studies of gene regulation and dissections of regulatory mechanisms behind human diseases.

Public Health Relevance

ChIP-seq is a powerful technology to analyze how genes'activities are controlled in normal cells and diseases. This proposal develops statistical and computational tools urgently needed by scientists to analyze large and complex ChIP-seq data sets. By allowing one to examine dynamic changes of global gene regulatory programs across different biological contexts, the new computational technologies developed in this proposal are expected to have a major impact on advancing future studies of regulatory mechanisms behind human diseases.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Zhou, Weiqiang; Sherwood, Ben; Ji, Zhicheng et al. (2017) Genome-wide prediction of DNase I hypersensitivity using gene expression. Nat Commun 8:1038
Kuang, Zheng; Pinglay, Sudarshan; Ji, Hongkai et al. (2017) Msn2/4 regulate expression of glycolytic enzymes and control transition from quiescence to growth. Elife 6:
Ji, Zhicheng; Zhou, Weiqiang; Ji, Hongkai (2017) Single-cell regulome data analysis by SCRAT. Bioinformatics 33:2930-2932
Zhou, Weiqiang; Sherwood, Ben; Ji, Hongkai (2016) Computational Prediction of the Global Functional Genomic Landscape: Applications, Methods, and Challenges. Hum Hered 81:88-105
Zhao, Tianqi; Cheng, Guang; Liu, Han (2016) A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA. Ann Stat 44:1400-1437
Kang, Jian; Bowman, F DuBois; Mayberg, Helen et al. (2016) A depression network of functionally connected regions discovered via multi-attribute canonical correlation graphs. Neuroimage 141:431-441
Zhang, Shilu; Du, Fang; Ji, Hongkai (2015) A novel DNA sequence motif in human and mouse genomes. Sci Rep 5:10444
Wamaitha, Sissy E; del Valle, Ignacio; Cho, Lily T Y et al. (2015) Gata6 potently initiates reprograming of pluripotent and differentiated cells to extraembryonic endoderm stem cells. Genes Dev 29:1239-55
Rosenblum, Michael; Liu, Han; Yen, En-Hsu (2014) Optimal Tests of Treatment Effects for the Overall Population and Two Subpopulations in Randomized Trials, using Sparse Linear Programming. J Am Stat Assoc 109:1216-1228
Zhao, Tuo; Liu, Han (2014) Calibrated Precision Matrix Estimation for High-Dimensional Elliptical Distributions. IEEE Trans Inf Theory 60:7874-7887

Showing the most recent 10 out of 18 publications