ChIP-seq is a powerful technology to map genome-wide protein-DNA interactions (PDIs). It is increasingly used by scientists worldwide to study how gene activities are controlled in normal cells and why they are disrupted in diseases. Applying ChIP-seq to study gene regulation faces three major challenges: (1) how to analyze large ChIP-seq data sets to discover dynamic changes of gene regulation across different biological contexts, (2) how to infer global regulatory programs under the practical constraint that it is not feasible to conduct ChIP-seq for all transcription factors (TFs), and (3) how to analyze allele-specific events given the small amount of data at heterozygote SNPs which cause low statistical power. This study investigates novel statistical and computational solutions to address the challenges above. First, a new method will be developed to discover and characterize dynamic changes of gene regulation across different biological contexts. This method, Generalized Differential Principal Component Analysis (dPCA/GDPCA), integrates unsupervised pattern discovery, dimension reduction and statistical inference into a single statistical framework. It provides a systematic solution to analyze quantitative and curve shape changes in large ChIP-seq data sets involving multiple proteins. It is expected to have a wide range of applications. Second, a computational framework will be developed to predict global gene regulation dynamics, i.e., dynamic changes of downstream regulatory events of all TFs for which DNA binding motif information is available. The analysis integrates the dynamic changes of histone modification ChIP-seq, DNase-seq, and FAIRE-seq data with DNA sequences, public ChIP-seq, and public gene expression data. It will provide a practical, affordable, and reasonably accurate solution to utilizing ChIP-seq to study many TFs simultaneously. A systematic benchmark study will also be con- ducted to evaluate the impact of technologies, data types and analytical methods on prediction performance. This benchmark study will provide guidelines for designing informative future experiments. Third, a method for detecting allele-specific protein-DNA binding (ASB) will be developed. The method is able to integrate information from multiple ChIP-seq data sets and completely phased genome sequences to significantly improve the statistical power of ASB inference. Various sources of biases will also be handled. Guidelines and new analytical tools generated by this study will allow one to design informative ChIP-seq experiments in the future such that by collecting one set of ChIP-seq data, one can not only identify locations of PDIs, but also infer global dynamic changes of TF binding sites across different biological contexts, and, if genotype data are available, robustly analyze allele-specific gene regulation. This will make ChIP-seq a low-cost high-reward experiment that serves multiple purposes. By significantly expanding the utility and increasing the power of ChIP-seq, our computational infrastructure is expected to have a major impact on advancing future studies of gene regulation and dissections of regulatory mechanisms behind human diseases.

Public Health Relevance

ChIP-seq is a powerful technology to analyze how genes'activities are controlled in normal cells and diseases. This proposal develops statistical and computational tools urgently needed by scientists to analyze large and complex ChIP-seq data sets. By allowing one to examine dynamic changes of global gene regulatory programs across different biological contexts, the new computational technologies developed in this proposal are expected to have a major impact on advancing future studies of regulatory mechanisms behind human diseases.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Kang, Jian; Bowman, F DuBois; Mayberg, Helen et al. (2016) A depression network of functionally connected regions discovered via multi-attribute canonical correlation graphs. Neuroimage 141:431-41
Zhang, Shilu; Du, Fang; Ji, Hongkai (2015) A novel DNA sequence motif in human and mouse genomes. Sci Rep 5:10444
Han, Fang; Liu, Han (2014) High Dimensional Semiparametric Scale-Invariant Principal Component Analysis. IEEE Trans Pattern Anal Mach Intell 36:2016-32
Rosenblum, Michael; Liu, Han; Yen, En-Hsu (2014) Optimal Tests of Treatment Effects for the Overall Population and Two Subpopulations in Randomized Trials, using Sparse Linear Programming. J Am Stat Assoc 109:1216-1228
Han, Fang; Liu, Han (2014) Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data. J Am Stat Assoc 109:275-287
Ying, Mingyao; Tilghman, Jessica; Wei, Yingying et al. (2014) Kruppel-like factor-9 (KLF9) inhibits glioblastoma stemness through global transcription repression and integrin α6 inhibition. J Biol Chem 289:32742-56
Zhao, Tuo; Liu, Han (2014) Calibrated Precision Matrix Estimation for High-Dimensional Elliptical Distributions. IEEE Trans Inf Theory 60:7874-7887
Kuang, Zheng; Cai, Ling; Zhang, Xuekui et al. (2014) High-temporal-resolution view of transcription and chromatin states across distinct metabolic states in budding yeast. Nat Struct Mol Biol 21:854-63
Wang, Jiayi; Park, Joo-Seop; Wei, Yingying et al. (2013) TRIB2 acts downstream of Wnt/TCF in liver cancer cells to regulate YAP and C/EBPα function. Mol Cell 51:211-25
Ji, Hongkai; Li, Xia; Wang, Qian-fei et al. (2013) Differential principal component analysis of ChIP-seq. Proc Natl Acad Sci U S A 110:6789-94

Showing the most recent 10 out of 11 publications