Bioinformatics

Li, Leping

Abstract

Multiple transcription factors often work together to regulate gene expression. For instance, a group of liver-specific transcription factors: Foxa2, Hnf1a, Hnf4a, and Hnf6 form a core regulatory module by binding to their own and to each others promoters to regulate gene expression in liver. Increasingly, investigators are interested in searching for not only the binding sites for the protein that was immuneprecipitated but also the binding sites for its transcriptional co-regulators. To my knowledge, there are only a few other computational methods (e.g., cisModule and Gibbs Centroid Sampler) for identifying several transcription factor motifs in the same sequences simultaneously. In practice, one often searches for motifs one-at-a-time and sequentially masks those that are already found. Although this strategy seems to work in some instances, it is not ideal as the order of the masking (even probabilistically) may affect the results. The cisModule developed by Zhou and Wong nearly a decade ago can handle at most only a few hundred sequences. We proposed a finite mixture model to describe the co-occurrence of two motifs in a set of sequences (Manuscript in review). We refer to the motif of the protein that was immunoprecipitated as the primary motif and one potential co-regulator motif as the secondary motif. We model the joint distribution of these two motifs and background noise, where the likelihood function allows for the possibilities that each motif can be absent, present in the plus orientation, or present in the reverse complement orientation thus, a sequence can be in one of nine states with respect to the two motifs. We use the EM algorithm to numerically maximize the likelihood of the resulting finite mixture model with respect to each motifs position weight matrix and the proportions in each of the nine states. We compute the posterior probabilities that any given sequence contains either a single motif, both motifs simultaneously, or neither motif, and we classify each sequence into one of the nine states accordingly. Our mixture model provides a user with a wealth of information on the presence or absence of both motifs in each sequence and the overall of proportions of sequences containing none, one or two motifs. Motif discovery with a mixture model is not new. A simpler mixture model coupled with an EM algorithm was previously proposed for motif discovery by Bailey and Elkan and implemented in MEME+ software. Our method differs from MEME+ in several ways. First, the model in MEME+ considers background plus one motif at a time, whereas we model the joint distribution of the two motifs and background simultaneously. To our knowledge, our approach is one of the few in existence that considers the joint distribution of the two motifs within a sequence and estimates the proportion of sequences in each of nine states. We investigated the performance of our method on several simulated data sets. To make the ChIP-seq simulations realistic, we used background sequences randomly taken from the mouse genome. In all simulations, we seeded these sequences so that the primary motif was present in about 90% of the sequences whereas the co-factor motif was present in 10% to 50% of the sequences. Both long and conserved co-factors such as Hnf1a and short and AT-rich co-factors such as Foxa2 were considered. In both cases, the primary and cofactor motifs were successfully identified. We also tested our method on a real dataset involving 12,000 400-bp sequences from a ChIP-seq experiment targeting Foxa2 in mouse liver. We identified co-occurrence of Foxa2 with Hnf4a, Cebpa, E-box, Ap1/Maf or Sp1 motifs in 5-33% of these sequences. All of these secondary motifs are either known liver-specific transcription factors or factors known to play an important role in liver function. The coMOTIF software is available for free download at: www.niehs.nih.gov/research/resources/software/comotif/index.cfm ChIP-seq/mRNA-seq involves several steps such as sample preparation, amplification and sequencing, each of which can introduce variation in the result. It is important to detect truly significant changes rather than variation from the extraneous sources. To accomplish this goal, we developed a framework that relies on a sequence of three tests. The first test is performed on individual samples to identify genomic regions whose mapped read counts differ significantly from the background. For those regions that pass the first test, we then test if the un-normalized read counts in each of the genomic regions differ between two samples. Finally, we test whether the ratio of the normalized read counts between the two samples exceeds normal variation. The maximal p-value from the last two tests is used as the criterion for testing the differential change. Specifically, the first test is applied to all genomic regions to filter out any regions where read tags do not have significantly more counts than expected from random Poisson background noise. This test is done region by region for each sample. The expected Poisson rate of noisy reads is estimated as the ratio of the total number of mapped reads over the length of the reference genome. The second test in our sequence is the exact rate ratio test and is applied only to those genomic regions retained after the first test as having counts that exceed background for at least one sample. This test determines whether the Poisson rate for read counts in a given genomic region differs between two samples. We construct an exact binomial test to test the null hypothesis that the ratio of the observed Poisson rates is the same as a null value. We describe several procedures for estimating the null value using all or partial mapped read counts. The third test in our sequence is applied only to those regions declared differentially changed by the second test. The third test is designed to find the genomic regions whose relative fold change in read counts between the two samples is larger than one would expect by chance under a normal assumption. The second test determines which regions have different Poisson rates between two samples. Simply comparing p-values from that test across genomic regions does not work well because genomic regions with high depth of coverage will have smaller p-values than regions of low depth of coverage even if ratio of Poisson rates is the same for both regions. To enable direct comparison of changes across genomic regions, we take as data the log2 ratios of read counts in the two conditions and construct a Z-test by assuming a Gaussian distribution for the log2 ratio. The null distribution of the log2 ratio comes from the regions that were declared non-differentially changed by the second test (exact rate ratio test). We demonstrated the effectiveness and the distinct features of our method on two published ChIP-seq and mRNA-seq datasets. Collaborative work: 1) Epigenetic regulation of developmental epithelial-mesenchymal transition in trophoblast stem cells 2) Identifying DNA sequence elements that are implicated in RNA Pol II pausing and nucleosome positioning in Drosophila genes

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of Environmental Health Sciences (NIEHS)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIAES101765-08
Application #: 8336626
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 8
Fiscal Year: 2011
Total Cost: $934,286
Indirect Cost

Institution

Name: National Institute of Environmental Health Sciences
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects

Publications

Miao, Yi-Liang; Gambini, Andrés; Zhang, Yingpei et al. (2018) Mediator complex component MED13 regulates zygotic genome activation and is required for postimplantation development in the mouse. Biol Reprod 98:449-464

Nguyen, Thuy-Ai T; Grimm, Sara A; Bushel, Pierre R et al. (2018) Revealing a human p53 universe. Nucleic Acids Res :

Ungewitter, Erica K; Rotgers, Emmi; Kang, Hong Soon et al. (2018) Loss of Glis3 causes dysregulation of retrotransposon silencing and germ cell demise in fetal mouse testis. Sci Rep 8:9662

Roy, Sumedha; Moore, Amanda J; Love, Cassandra et al. (2018) Id Proteins Suppress E2A-Driven Invariant Natural Killer T Cell Development prior to TCR Selection. Front Immunol 9:42

Li, Yuanyuan; Umbach, David M; Li, Leping (2017) Putative genomic characteristics of BRAF V600K versus V600E cutaneous melanoma. Melanoma Res 27:527-535

Fan, Zheng; Ahn, Mihye; Roth, Heidi L et al. (2017) Sleep Apnea and Hypoventilation in Patients with Down Syndrome: Analysis of 144 Polysomnogram Studies. Children (Basel) 4:

Ren, Natalie S X; Ji, Ming; Tokar, Erik J et al. (2017) Haploinsufficiency of SIRT1 Enhances Glutamine Metabolism and Promotes Cancer Development. Curr Biol 27:483-494

Li, Yuanyuan; Kang, Kai; Krahn, Juno M et al. (2017) A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics 18:508

Lowe, Julie M; Nguyen, Thuy-Ai; Grimm, Sara A et al. (2017) The novel p53 target TNFAIP8 variant 2 is increased in cancer and offsets p53-dependent tumor suppression. Cell Death Differ 24:181-191

Stumpo, Deborah J; Trempus, Carol S; Tucker, Charles J et al. (2016) Deficiency of the placenta- and yolk sac-specific tristetraprolin family member ZFP36L3 identifies likely mRNA targets and an unexpected link to placental iron metabolism. Development 143:1424-33

Showing the most recent 10 out of 36 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: