Methods for DNA Sequence Analysis Multiple transcription factors often work together to regulate gene expression. For instance, a group of liver-specific transcription factors: Foxa2, Hnf1a, Hnf4a, and Hnf6 form a core regulatory module by binding to their own and to each others promoters to regulate gene expression in liver. Increasingly, investigators are interested in searching for not only the binding sites for the protein that was immuneprecipitated but also the binding sites for its transcriptional co-regulators. To my knowledge, there are only a few other computational methods (e.g., cisModule and Gibbs Centroid Sampler) for identifying several transcription factor motifs in the same sequences simultaneously. In practice, one often searches for motifs one at a time and sequentially masks those that have already been found. Although this strategy seems to work in some instances, it is not ideal as the order of the masking (even probabilistically) may affect the results. The cisModule developed by Zhou and Wong nearly a decade ago can handle at most only a few hundred sequences. We developed a finite mixture model to identify and describe the co-occurrence of two motifs in a set of sequences. We refer to the motif of the protein that was immunoprecipitated as the primary motif and the potential co-regulator motif as the secondary motif. We model the joint distribution of these two motifs and background noise, where the likelihood function allows for the possibilities that each motif can be absent, present in the plus orientation, or present in the reverse complement orientation thus, a sequence can be in one of nine states with respect to the two motifs. We use the EM algorithm to numerically maximize the likelihood of the resulting finite mixture model with respect to each motifs position weight matrix and the proportions in each of the nine states. We compute the posterior probabilities that any given sequence contains either a single motif, both motifs simultaneously, or neither motif, and we classify each sequence into one of the nine states accordingly. Our mixture model provides the user with a wealth of information on the presence or absence of both motifs in each sequence and the overall of proportions of sequences containing none, one or two motifs. Motif discovery with a mixture model is not new. A simpler mixture model coupled with an EM algorithm was previously proposed for motif discovery by Bailey and Elkan and implemented in MEME+ software. Our method differs from MEME+ in several ways. First, the model in MEME+ considers background plus one motif at a time, whereas we model the joint distribution of the two motifs and background simultaneously. To our knowledge, our approach is one of the few in existence that considers the joint distribution of the two motifs within a sequence and estimates the proportion of sequences in each of nine states. We investigated the performance of our method using several simulated data scenarios. To make the ChIP-seq simulations realistic, we used background sequences randomly taken from the mouse genome. In all simulations, we seeded these sequences so that the primary motif was present in about 90% of the sequences whereas the co-factor motif was present in 10% to 50% of the sequences. Both long and conserved co-factors such as Hnf1a and short and AT-rich co-factors such as Foxa2 were considered. In both cases, the primary and cofactor motifs were successfully identified. We also tested our method on a real dataset involving 12,000 400-bp sequences from a ChIP-seq experiment targeting Foxa2 in mouse liver. We identified co-occurrence of Foxa2 with Hnf4a, Cebpa, E-box, Ap1/Maf or Sp1 motifs in 5-33% of these sequences. All of these secondary motifs are either known liver-specific transcription factors or factors known to play an important role in liver function. Methods for ChIP-seq/mRNA-seq Data Analysis The typical data generated by mRNA-seq and ChIP-seq are short sequence reads which are first mapped to the reference genome or cDNA transcripts. Increasingly, mRNA-seq and ChIP-seq are used to identify genomic loci or transcripts that differ in sequence read counts in two-sample comparative studies. ChIP-seq/mRNA-seq involves several steps such as sample preparation, amplification and sequencing, each of which can introduce variation in the results. It is important to distinguish truly significant changes from the variation from extraneous sources. To accomplish this goal, we developed a framework that employs our novel parsimony read count normalization methods, and three steps. First, we filter out background regions in all samples. Second, we check each non-background region for significant read count differences between the two samples using the exact rate ratio test. Third, we check each non-background region to see whether the ratio of the normalized read counts between the two samples exceeds those expected from biological replicates. Finally, we combine results from the second and third steps by taking the maximum P-value from those two tests for each region and use its value to determine whether the region shows important differences between the samples. Specifically, the first step is applied to all genomic regions to filter out any regions where read tags do not have significantly more counts than expected from random Poisson background noise. This test is done region by region for each sample. The expected Poisson rate of noisy reads is estimated as the ratio of the total number of mapped reads over the length of the reference genome. The second step in our sequence uses the exact rate ratio test and is applied only to those genomic regions retained after the first test as having counts that exceed background for at least one sample. This test determines whether the Poisson rate for read counts in a given genomic region differs between two samples. We construct an exact binomial test that compares the ratio of the observed Poisson rates to an estimated null value. We estimate the null value because different samples may differ in read coverage through differences in sequencing effort (e.g., number of lanes) or sequence quality effects which could contradict the default null value of 1. We describe several procedures for estimating the null value using all or partial mapped read counts in our article published in Nucleic Acids Research. The third step in our sequence is applied to the same non-background regions as in the second step and is designed to find the genomic regions whose relative fold change in read counts between the two samples is larger than one would expect by chance under a normal assumption. The second test determines which regions have different Poisson rates between two samples. Simply comparing p-values from that test across genomic regions does not work well because genomic regions with high depth of coverage will have smaller p-values than regions of low depth of coverage even if ratio of Poisson rates is the same for both regions. To enable direct comparison of changes across genomic regions, we take as data the log2 ratios of read counts in the two conditions and construct a Z-test by assuming a Gaussian distribution for the log2 ratio. The null distribution of the log2 ratio comes can be estimated from biological replicates or estimated from the regions declared non-differentially changed by our exact rate ratio test if no biological replicates are available. We demonstrated the effectiveness and the distinct features of our method on two published ChIP-seq and mRNA-seq datasets. We have also applied it to several datasets from our collaborators, some of which are described in the next Section.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Zip Code
Miao, Yi-Liang; Gambini, Andrés; Zhang, Yingpei et al. (2018) Mediator complex component MED13 regulates zygotic genome activation and is required for postimplantation development in the mouse. Biol Reprod 98:449-464
Nguyen, Thuy-Ai T; Grimm, Sara A; Bushel, Pierre R et al. (2018) Revealing a human p53 universe. Nucleic Acids Res :
Ungewitter, Erica K; Rotgers, Emmi; Kang, Hong Soon et al. (2018) Loss of Glis3 causes dysregulation of retrotransposon silencing and germ cell demise in fetal mouse testis. Sci Rep 8:9662
Roy, Sumedha; Moore, Amanda J; Love, Cassandra et al. (2018) Id Proteins Suppress E2A-Driven Invariant Natural Killer T Cell Development prior to TCR Selection. Front Immunol 9:42
Li, Yuanyuan; Umbach, David M; Li, Leping (2017) Putative genomic characteristics of BRAF V600K versus V600E cutaneous melanoma. Melanoma Res 27:527-535
Fan, Zheng; Ahn, Mihye; Roth, Heidi L et al. (2017) Sleep Apnea and Hypoventilation in Patients with Down Syndrome: Analysis of 144 Polysomnogram Studies. Children (Basel) 4:
Ren, Natalie S X; Ji, Ming; Tokar, Erik J et al. (2017) Haploinsufficiency of SIRT1 Enhances Glutamine Metabolism and Promotes Cancer Development. Curr Biol 27:483-494
Li, Yuanyuan; Kang, Kai; Krahn, Juno M et al. (2017) A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics 18:508
Lowe, Julie M; Nguyen, Thuy-Ai; Grimm, Sara A et al. (2017) The novel p53 target TNFAIP8 variant 2 is increased in cancer and offsets p53-dependent tumor suppression. Cell Death Differ 24:181-191
Stumpo, Deborah J; Trempus, Carol S; Tucker, Charles J et al. (2016) Deficiency of the placenta- and yolk sac-specific tristetraprolin family member ZFP36L3 identifies likely mRNA targets and an unexpected link to placental iron metabolism. Development 143:1424-33

Showing the most recent 10 out of 36 publications