Characterization of constitutive CTCF/cohesin loci: a possible role in establishing topological domains in mammalian genomes Recent studies suggested that human/mammalian genomes are divided into large, discrete domains that are units of chromosome organization. CTCF, a CCCTC binding factor, has a diverse role in genome regulation including transcriptional regulation, chromosome-boundary insulation, DNA replication, and chromatin packaging. It remains unclear whether a subset of CTCF binding sites plays a functional role in establishing/maintaining chromatin topological domains. We systematically analysed the genomic, transcriptomic and epigenetic profiles of the CTCF binding sites in 56 human cell lines from ENCODE. We identified 24,000 CTCF sites (referred to as constitutive sites) that were bound in more than 90% of the cell lines. Our analysis revealed: 1) constitutive CTCF loci were located in constitutive open chromatin and often co-localized with constitutive cohesin loci;2) most constitutive CTCF loci were distant from transcription start sites and lacked CpG islands but were enriched with the full-spectrum CTCF motifs: a recently reported 33/34-mer and two other potentially novel (22/26-mer);3) more importantly, most constitutive CTCF loci were present in CTCF-mediated chromatin interactions detected by ChIA-PET and these pair-wise interactions occurred predominantly within, but not between, topological domains identified by Hi-C. Our results suggest that the constitutive CTCF sites may play a role in organizing/maintaining the recently identified topological domains that are common across most human cells. Developing an annotation and visualization tool for ChIP-seq data A typical ChIP-seq experiment identifies tens of thousands of loci bound by a protein. Often, data analysis such as locus annotation is carried out by someone other than the biologist who generated the data. Moreover, there is no easy way to graphically visualize all the loci simultaneously. Although one can submit one locus at a time to the UCSC genome browser for visualization, sequential visualization is practical for a few loci but not for tens of thousands of them. This limitation hampers biologists in discovering interesting loci for hypothesis generation. We have developed a publically available tool for annotating and dynamically visualizing all loci from one or more ChIP-seq experiments. It is designed with non-bioinformaticians in mind and presents a straightforward user interface. Our server annotates each locus with respect to the known gene information available at NCBI and on the UCSC genome browser. It outputs the annotation result in any of various formats, including Excel spreadsheets, tab-separated text files, and HTML documents. The usual information such as the distance from a ChIP-seq locus to the nearest transcription start site, the symbol and description of the associated gene, etc is provided. More importantly, in the HTML output, each locus is displayed in a graphic window in the context of the respective genome. This allows the biologist to instantly tell if a locus is intronic, exonic, or in the upstream promoter region. One can also zoom in or out to a larger or smaller region for visualization, exploration and discovery. The HTML output also displays a pie chart showing the distributions of the loci in UTRs, introns, exons, and promoters. A user is also able to search for his/her favorite gene or locus and make other kinds of queries. We expect that this project will significantly advance the state of the art in web-based genomics interfaces. The visualization system is based on modern interface principles and is designed to be intuitive and easy to use, rather than depending on extensive documentation. We also hope that the MiniBrowser Python/Java script interface widget developed for this project will also be useful for other web-based bioinformatics tools. Our tool allows the biologists who generated the data to explore their data themselves, with the benefit of their own intuition, so as to enhance discovery and hypothesis generation. T-KDE: A method for genome-wide identification of constitutive protein binding sites from multiple ChIP-seq data sets A protein may bind to its target DNA sites constitutively, i.e., regardless of cell type. Intuitively, constitutive binding sites should be biologically functional. Knowing the locations of all constitutive sites for a protein of interest is prerequisite for understanding these sites functional relevance. Robust and efficient computational methods for identifying constitutive binding sites are lacking, however. We propose a method, T-KDE, to identify the locations of constitutive binding sites. T-KDE, which combines a binary range tree with a kernel density estimator, is applied to ChIP-seq data from multiple cell lines. Using a set of constitutive CTCF (CCCTC-binding factor) sites identified through motif analysis as the gold standard, we compared T-KDE with binning-based approach and demonstrated that T-KDE performs superior. Furthermore, we showed that T-KDE can identify additional constitutive sites that were missed by motif-based approach due to two possible scenarios: 1) A site may be bound in all cell lines but failed to reach the motif significance cutoff;2) A site may be missed if the peak sequence used in motif scan is not long enough. Motif analysis of the set of constitutive CTCF sites that failed to reach motif significance discovered two new CTCF motif variants. Using data from ENCODE on 22 transcription factors (TF) in 112 cell lines, we identified constitutive binding sites for each TF and provide evidence that, for some TFs, they may be biologically meaningful. Besides constitutive binding sites for a given TF, T-KDE can identify genomic hot spots where several different proteins bind and, conversely, cell-specific sites bound by a given protein. We showed that, using 116 CTCF ChIP-seq datasets as example, T-KDE is relatively robust to the choice of the free parameter and is highly accurate when compared to the identification of constitutive binding sites through motif analysis. We also have several long standing collaborations with intramural investigators. Specifically, a) Identifying differentially expressed genes in wild-type Zfp36l3 and Zfp36l3 knockout (KO) mouse placentas using Affymetrix and Agilent arrays and deep sequencing (mRNA-seq) (PI Blackshear). b) Identifying Zfp36l3 target by RNA-seq analysis (PI Blackshear). c) Role of Med13 in embryo development (PI Williams) d) Genome-wide tamoxifen induced ER alpha binding specificity (PI Korach).
|Li, Yuanyuan; Umbach, David M; Li, Leping (2014) T-KDE: a method for genome-wide identification of constitutive protein binding sites from multiple ChIP-seq data sets. BMC Genomics 15:27|
|Li, Yin; Hamilton, Katherine J; Lai, Anne Y et al. (2014) Diethylstilbestrol (DES)-stimulated hormonal toxicity is mediated by ER? alteration of target gene methylation patterns and epigenetic modifiers (DNMT3A, MBD2, and HDAC2) in the mouse seminal vesicle. Environ Health Perspect 122:262-8|
|Hewitt, Sylvia C; Li, Leping; Grimm, Sara A et al. (2014) Novel DNA motif binding activity observed in vivo with an estrogen receptor ? mutant mouse. Mol Endocrinol 28:899-911|
|Li, Yuanyuan; Huang, Weichun; Niu, Liang et al. (2013) Characterization of constitutive CTCF/cohesin loci: a possible role in establishing topological domains in mammalian genomes. BMC Genomics 14:553|
|Li, Leping (2009) GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J Comput Biol 16:317-29|
|Dowd, T L; Li, L; Gundberg, C M (2008) The (1)H NMR structure of bovine Pb(2+)-osteocalcin and implications for lead toxicity. Biochim Biophys Acta 1784:1534-45|
|Lin, Rongheng; Dai, Shuangshuang; Irwin, Richard D et al. (2008) Gene set enrichment analysis for non-monotone association and multiple experimental categories. BMC Bioinformatics 9:481|