ChIP-seq is a powerful technology to map genome-wide protein-DNA interactions (PDIs). It is increasingly used by scientists worldwide to study how gene activities are controlled in normal cells and why they are disrupted in diseases. Applying ChIP-seq to study gene regulation faces three major challenges: (1) how to analyze large ChIP-seq data sets to discover dynamic changes of gene regulation across different biological contexts, (2) how to infer global regulatory programs under the practical constraint that it is not feasible to conduct ChIP-seq for all transcription factors (TFs), and (3) how to analyze allele-specific events given the small amount of data at heterozygote SNPs which cause low statistical power. This study investigates novel statistical and computational solutions to address the challenges above. First, a new method will be developed to discover and characterize dynamic changes of gene regulation across different biological contexts. This method, Generalized Differential Principal Component Analysis (dPCA/GDPCA), integrates unsupervised pattern discovery, dimension reduction and statistical inference into a single statistical framework. It provides a systematic solution to analyze quantitative and curve shape changes in large ChIP-seq data sets involving multiple proteins. It is expected to have a wide range of applications. Second, a computational framework will be developed to predict global gene regulation dynamics, i.e., dynamic changes of downstream regulatory events of all TFs for which DNA binding motif information is available. The analysis integrates the dynamic changes of histone modification ChIP-seq, DNase-seq, and FAIRE-seq data with DNA sequences, public ChIP-seq, and public gene expression data. It will provide a practical, affordable, and reasonably accurate solution to utilizing ChIP-seq to study many TFs simultaneously. A systematic benchmark study will also be con- ducted to evaluate the impact of technologies, data types and analytical methods on prediction performance. This benchmark study will provide guidelines for designing informative future experiments. Third, a method for detecting allele-specific protein-DNA binding (ASB) will be developed. The method is able to integrate information from multiple ChIP-seq data sets and completely phased genome sequences to significantly improve the statistical power of ASB inference. Various sources of biases will also be handled. Guidelines and new analytical tools generated by this study will allow one to design informative ChIP-seq experiments in the future such that by collecting one set of ChIP-seq data, one can not only identify locations of PDIs, but also infer global dynamic changes of TF binding sites across different biological contexts, and, if genotype data are available, robustly analyze allele-specific gene regulation. This will make ChIP-seq a low-cost high-reward experiment that serves multiple purposes. By significantly expanding the utility and increasing the power of ChIP-seq, our computational infrastructure is expected to have a major impact on advancing future studies of gene regulation and dissections of regulatory mechanisms behind human diseases.

Public Health Relevance

ChIP-seq is a powerful technology to analyze how genes'activities are controlled in normal cells and diseases. This proposal develops statistical and computational tools urgently needed by scientists to analyze large and complex ChIP-seq data sets. By allowing one to examine dynamic changes of global gene regulatory programs across different biological contexts, the new computational technologies developed in this proposal are expected to have a major impact on advancing future studies of regulatory mechanisms behind human diseases.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006841-02
Application #
8543753
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
2012-09-12
Project End
2015-05-31
Budget Start
2013-06-01
Budget End
2014-05-31
Support Year
2
Fiscal Year
2013
Total Cost
$309,420
Indirect Cost
$100,876
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21218
Kuang, Zheng; Ji, Zhicheng; Boeke, Jef D et al. (2018) Dynamic motif occupancy (DynaMO) analysis identifies transcription factors and their binding sites driving dynamic biological processes. Nucleic Acids Res 46:e2
Fan, Jianqing; Liu, Han; Sun, Qiang et al. (2018) I-LAMM FOR SPARSE LEARNING: SIMULTANEOUS CONTROL OF ALGORITHMIC COMPLEXITY AND STATISTICAL ERROR. Ann Stat 46:814-841
Zhou, Weiqiang; Sherwood, Ben; Ji, Zhicheng et al. (2017) Genome-wide prediction of DNase I hypersensitivity using gene expression. Nat Commun 8:1038
Kuang, Zheng; Pinglay, Sudarshan; Ji, Hongkai et al. (2017) Msn2/4 regulate expression of glycolytic enzymes and control transition from quiescence to growth. Elife 6:
Ji, Zhicheng; Zhou, Weiqiang; Ji, Hongkai (2017) Single-cell regulome data analysis by SCRAT. Bioinformatics 33:2930-2932
Kang, Jian; Bowman, F DuBois; Mayberg, Helen et al. (2016) A depression network of functionally connected regions discovered via multi-attribute canonical correlation graphs. Neuroimage 141:431-441
Zhao, Tuo; Liu, Han (2016) Accelerated Path-following Iterative Shrinkage Thresholding Algorithm with Application to Semiparametric Graph Estimation. J Comput Graph Stat 25:1272-1296
Zhou, Weiqiang; Sherwood, Ben; Ji, Hongkai (2016) Computational Prediction of the Global Functional Genomic Landscape: Applications, Methods, and Challenges. Hum Hered 81:88-105
Zhao, Tianqi; Cheng, Guang; Liu, Han (2016) A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA. Ann Stat 44:1400-1437
Wamaitha, Sissy E; del Valle, Ignacio; Cho, Lily T Y et al. (2015) Gata6 potently initiates reprograming of pluripotent and differentiated cells to extraembryonic endoderm stem cells. Genes Dev 29:1239-55

Showing the most recent 10 out of 22 publications