The advent of high throughput next generation sequencing (NGS) technologies have revolutionized the fields of genetics and genomics by allowing rapid and inexpensive sequencing of billions of bases. Among the NGS applications, ChIP-seq (chromatin immunoprecipitation followed by NGS) is perhaps the most successful to date. ChIP-seq technology enables investigators to study genome-wide binding of transcription factors and mapping of epigenomic marks. Both of these play crucial roles in programming of cell specific gene expression;therefore their genome-wide mapping can significantly advance our ability to understand and diagnose human diseases. Although basic analysis tools for ChIP-seq data are rapidly increasing, there has not been much progress on the design problems regarding ChIP-seq experiments. A challenging question that the researchers planning a ChIP-seq experiment need to answer is: how deeply should the ChIP and the control samples be sequenced? The answer depends on multiple factors some of which can be set by the experimenter based on pilot/preliminary data. The sequencing depth of a ChIP-seq experiment is one of the key factors that determine whether or not all the underlying targets (e.g., binding locations or epigenomic profiles) can be identified with a targeted power. This is especially important when the goal is the analysis of individual-to-individual and allele specific variation o transcription factor binding and epigenomic profiles. Insufficient sequencing depths may lead to spurious differences in binding or epigenome profiles. In this proposal, we aim to develop a general framework for power calculations in ChIP-seq experiments with three specific aims and by considering statistical models commonly used in ChIP-seq analysis: (1) Power calculations based on the conditional Binomial model;(2) Power calculations based on the Poisson and Negative Binomial regression models;(3) A power calculation tool for GALAXY and Bioconductor. This project will be accomplished through a combination of theoretical/methodological development, simulation, computational analysis, and experimental validation. Methods will be developed and evaluated using datasets from the ENCODE, modENCODE, and the RoadMap Epigenomics consortiums as well as novel datasets from collaborators. Statistical resources generated from the project, which will be disseminated in publicly available software, will provide essential tools for the efficient design of ChIP-seq experiments.
The proposed research is relevant to public health because capturing genome-wide binding of transcription factors and epigenomic information by ChIP-seq technology is invaluable for comprehensively understanding development, differentiation, and disease. Design of ChIP-seq experiments present unprecedented challenges. We will develop a statistical framework for power calculations in designing ChIP-seq experiments and disseminate results and software to the research community.
|Sun, Guannan; Srinivasan, Rajini; Lopez-Anido, Camila et al. (2014) In silico pooling of ChIP-seq control experiments. PLoS One 9:e109691|
|Zuo, Chandler; Keles, Sunduz (2014) A statistical framework for power calculations in ChIP-seq experiments. Bioinformatics 30:753-60|
|Chung, Dongjun; Park, Dan; Myers, Kevin et al. (2013) dPeak: high resolution identification of transcription factor binding sites from PET and SET ChIP-Seq data. PLoS Comput Biol 9:e1003246|
|Myers, Kevin S; Yan, Huihuang; Ong, Irene M et al. (2013) Genome-scale analysis of escherichia coli FNR reveals complex features of transcription factor binding. PLoS Genet 9:e1003565|