My main focus has been on methods for identifying transcription factor binding sites in sequences. For a transcription factor with a set of experimentally verified binding motif sequences, a position weight matrix (PWM) may be constructed from these known sequences by calculating the proportions of sequences for which each specific base, A, C, G, and T, occurs at each position in a set of aligned motif sequences. Once a PWM is constructed, it can be used to scan sequences for putative binding sites using a sliding window of length of the PWM to score how well each sequence segment in the window matches the PWM. A site is declared when the score passes a predefined cutoff. While this approach has provided useful hits to experimental investigators, one practical problem is that the false positive rate is often high. Short motifs can be found easily by chance in long sequences. The commonly used PWMs assume that the positions within a motif are mutually independent, i.e., a motif sequence follows a product of multinomial distributions. Thus, the observed frequencies of A, C, G, and T in each column are the maximum likelihood (ML) estimates of the distribution of the multinomial random variable for that column, regardless of the contents of nearby columns. Furthermore, the number of known instances of a transcription factor binding site in public databases such as TRANSFAC is typically small. The maximum likelihood (ML) estimates may be poor, as the estimators are vulnerable to overfitting when based on insufficient data. The resultant PWM models may be ineffective in distinguishing a true motif from a random segment. A further complication arises from the choice of cut point for declaring a site to be a motif. A less stringent cut point results in a large number of false positives whereas a more stringent cut point eliminates true positives. We have updated the GADEM software with substantial improvements and additional features. We added a seeded analysis in which a user-specified position weight matrix (PWM) is the starting PWM model. Seeded analyses are at least 10x faster and perhaps more accurate than the already scalable unseeded analyses, and can identify short and less abundant motifs, and variants of dominant motifs. We propose an approach for estimating the number of binding sites in the data, include non-uniform motif priors that take advantage of the high spatial resolution of ChIP-seq data, and support higher-order Markov background models. Finally, GADEM now reports each motifs fold enrichment in input data vs. background/random sequence data. In collaboration with B. Hoffman, G. Robertson, P. Hoodless and S. Jones at the British Columbia Cancer Agency (Vancouver, Canada), I have analyzed five large ChIP-seq datasets (8,000 to 14,000 sequences, each) (FoxA2 in adult and E14.5 mouse liver and in pancreatic islets;Hnf4a in adult mouse liver;Pdx1 in pancreatic islets). Binding sites for the three transcription factors and their cofactors have been identified in these datasets. The transcription factor data together with the genome-wide H4K4me1 location profiles in these tissues provide insights on tissue specific regulation of gene expression. Methods for identifying co-regulators motifs in ChIP-seq data. A typical ChIP-seq experiment profiles the genome-wide binding of a single transcription factor. It is known that multiple transcription factors may work together to regulate gene expression in development and specification. Most existing methods for motif discovery consider only one motif at a time. We have developed a multi-component mixture framework to model the joint distribution of two motifs. Our method uses the expectation-maximization algorithm to numerically maximize the observed data likelihood with respect to the proportions and position weight matrices of the two motifs. Using maximum likelihood estimates of these parameters, we compute the posterior probabilities of any given sequence and classify it as containing either motif 1 or motif 2, both motifs 1 and 2, or pure statistical noise. We tested our method on a set of 12,000 Hnf4a ChIP-seq peaks, each of 400bp, from E14.5 mouse liver (Hoffman et al., Genome Res., 2010). We identified co-existence of Hnf4a with other transcription factors such as Hnf1, Foxa1/2, and Cebp in 10-30% of the peaks, suggesting that those factors may function as potential co-regulators of Hnf4a in regulating liver differentiation and function. The Next-Gen sequencing based mRNA-seq and ChIP-seq have been increasingly used for identifying genome-wide epigenetic changes. The new type and huge volume of data from these technologies, however, pose computational challenges unmet by existing methods. We describe a simple and effective hierarchical statistical framework for identifying differential epigenetic changes between samples. Our method allows one to identify genomic regions of differential epigenetic changes and to detect differentially expressed genes. The method has been used in analyzing histone ChIP-seq and mRNA-seq data. Collaborative work: Identified estrogen response elements on IFG1 and STATA5 in mouse. The work suggests that IGF1 is regulated directly by estrogen, not through AP1 as suggested previously by others. Identified CTCF binding sites on BCL6 gene. Several CTCF binding sites were found in the first intron of BCL6, where CpG islands co-localize. CTCF binding to those sites is blocked when the CpG islands is methylated. The work indicates that BCL6 expression is maintained during lymphomagenesis, in part, through elevated DNA methylation to prevent CTCF-mediated silencing of the gene BCL6 gene. Elucidated the role of DNA sequence features on RNA polymerase II pausing and nucleosome positioning in Drosophila genes. I have identified several sequence elements/features, both expected and unexpected, that are associated with RNA pol II pausing and nucleosome positioning. Developed an integrated pipeline for genome-wide analysis of transcription actor binding sites from ChIP-seq data (with researchers at UBC). This pipeline employs my GADEM software as the motif discovery tool.
Showing the most recent 10 out of 36 publications