Bioinformatics

Li, Leping

Abstract

My main focus has been on methods for identifying transcription factor binding sites in sequences. For a transcription factor with a set of experimentally verified binding motif sequences, a position weight matrix (PWM) may be constructed from these known sequences by calculating the proportions of sequences for which each specific base, A, C, G, and T, occurs at each position in a set of aligned motif sequences. Once a PWM is constructed, it can be used to scan sequences for putative binding sites using a sliding window of length of the PWM to score how well each sequence segment in the window matches the PWM. A site is declared when the score passes a predefined cutoff. While this approach has provided useful hits to experimental investigators, one practical problem is that the false positive rate is often high. Short motifs can be found easily by chance in long sequences. The commonly used PWMs assume that the positions within a motif are mutually independent, i.e., a motif sequence follows a product of multinomial distributions. Thus, the observed frequencies of A, C, G, and T in each column are the maximum likelihood (ML) estimates of the distribution of the multinomial random variable for that column, regardless of the contents of nearby columns. Furthermore, the number of known instances of a transcription factor binding site in public databases such as TRANSFAC is typically small. The maximum likelihood (ML) estimates may be poor, as the estimators are vulnerable to overfitting when based on insufficient data. The resultant PWM models may be ineffective in distinguishing a true motif from a random segment. A further complication arises from the choice of cut point for declaring a site to be a motif. A less stringent cut point results in a large number of false positives whereas a more stringent cut point eliminates true positives. ? ? fdrMotif: Identifying cis-elements by an EM algorithm coupled with false discovery rate control? ? Most de novo motif identification methods optimize the motif model first and then separately test the statistical significance of the motif score. In the first stage, a motif abundance parameter needs to be specified or modeled. In the second stage, a z-score or p-value is used as the test statistic. Error rates under multiple comparisons are not fully considered. We propose a simple but novel approach, fdrMotif, that selects as many binding sites as possible while controlling a user-specified false discovery rate (FDR), defined as the expected proportion of non-motif subsequences falsely declared as binding sites. Unlike existing iterative methods, fdrMotif combines model optimization (e.g., position weight matrix (PWM)) and significance testing at each step. fdrMotif estimates a high-order Markov model from the original sequence data and uses it to generate many sets of simulated background sequences. By monitoring the proportion of binding sites selected in these background sequences, fdrMotif controls the FDR in the original data. The model is then updated using an expectation (E)/maximization (M) procedure. We propose a new normalization procedure in the E-step for updating the model. This process is repeated until either the model converges or the number of iterations exceeds a maximum. fdrMotif can take multiple PWMs as the starting estimates for the EM algorithm and automatically run one at a time to ensure uniqueness of the solution.? ? Simulation studies suggest that our normalization procedure assigns larger weights to the binding sites than do two other commonly used normalization procedures. Furthermore, fdrMotif requires only a user-specified FDR and an initial PWM. When tested on sequences containing 542 high confidence experimental p53 binding loci, fdrMotif identified 569 p53 binding sites in 505 (93.2%) sequences. In comparison, MEME identified more binding sites but in fewer ChIP sequences than fdrMotif. When tested on 500 sets of simulated ChIP sequences with embedded known p53 binding sites, fdrMotif, compared to MEME, has higher sensitivity with similar positive predictive value. Furthermore, fdrMotif is robust to noise: it selected nearly identical binding sites in data adulterated with 50% added background sequences and the unadulterated data. We suggest that fdrMotif represents an improvement over MEME.? ? Collaborative research in sequence analysis? ? Oct4/Sox2 transactivates pluripotency-associated cell cycle regulatory microRNAs in human embryonic stem cells? ? Oct4, Sox2, and Nanog are transcription factors required for pluripotency during early embryogenesis and maintenance of embryonic stem cell (ESC) identity. Archers lab has been interested in understanding the roles of these transcription factors in pluripotency. I have been collaborating with Archer on identifying Oct4/Sox2 and Nanog transcriptional target genes. I carried out computational analysis of the promoters of all known human genes for Oct4/Sox2 binding sites. Many conserved putative Oct4/Sox2 binding sites were identified and Oct4 binding to some of the predicted sites were confirmed by ChIP experiments carried out by Archers lab. Among the predicted targets, we decided to focus on a microRNA cluster that consists of eight microRNAs (mir-302a-d, mir-302a*-c* and mir-367) on chromosome 4. Mir-302 is highly expressed in ESCs. I identified putative Oct4, Sox2, Nanog, and Stat3 binding sites in the promoter region of mir-302 cluster using position weight matrix analyses. Gel shift and ChIP experiments carried out by Archers lab confirmed that Oct4 was bound at the predicted site. Archers Lab showed that expression of the primary transcript of the mir-302 cluster is dependent on Oct4 and Sox2 in human ESCs, and its expression pattern also parallels Oct4 expression during embryogenesis. ? ? Sequence analysis of genes with promoter-proximally stalled Pol II? ? Recently, Adelmans lab has performed a genome-wide analysis in Drosophila and identified approximately 1000 genes with promoter-proximally stalled Pol II. The Pol II stalled genes respond to environmental or developmental stimuli, suggesting that the rapid release of stalled Pol II facilitates efficient responses to the changing environment. To identify enriched motifs in the proximal promoter regions of the stalled genes, I analyzed two regions (-1kb to +200bp and -200bp to +500bp, both of which are relative to the transcription start site) of sequences of the stalled genes using 1) existing tools such as MEME; 2) tools developed in my group such as fdrMotif (Section II.4) and a newly developed motif identification tool. In addition, I also compared motif abundances in these sequences relative to the same regions of all known Drosophila genes using the position weight matrices (PWM) in the TRANSFAC database as the motif models. Several over-represented motifs were independently identified, including the GAGA factor (GAF) motif. The logo plot of the 600 binding sites identified in the -1kb to +200bp sequences of 1000 genes is shown below. ChIP analysis carried out in Adelmans lab confirmed that GAF was bound to 22 of the 24 selected predicted targets. The GAF is encoded by the Trithorax-like (Trl) gene, which has been demonstrated to be essential in the regulation of multiple developmental proteins. GAF protein has been linked to modifications of chromatin structures.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of Environmental Health Sciences (NIEHS)
Type: Intramural Research (Z01)
Project #: 1Z01ES101765-05
Application #: 7734537
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 5
Fiscal Year: 2008
Total Cost: $365,942
Indirect Cost

Institution

Name: National Institute of Environmental Health Sciences
Department
Type
DUNS #

City
State
Country: United States
Zip Code

Related projects


NIH 2008 Z01 ES	Bioinformatics Li, Leping / National Institute of Environmental Health Sciences	$365,942
NIH 2007 Z01 ES	Bioinformatics Li, Leping / National Institute of Environmental Health Sciences	$769,823
NIH 2006 Z01 ES	Bioinformatics Li, Leping / U.S. National Inst of Environ Hlth Scis
NIH 2005 Z01 ES	Bioinformatics Li, Leping / U.S. National Inst of Environ Hlth Scis
NIH 2004 Z01 ES	Bioinformatics Li, Leping / U.S. National Inst of Environ Hlth Scis

Publications

Fan, Zheng; Ahn, Mihye; Roth, Heidi L et al. (2017) Sleep Apnea and Hypoventilation in Patients with Down Syndrome: Analysis of 144 Polysomnogram Studies. Children (Basel) 4:

Xu, Zongli; Niu, Liang; Li, Leping et al. (2016) ENmix: a novel background correction method for Illumina HumanMethylation450 BeadChip. Nucleic Acids Res 44:e20

Li, Yuanyuan; Krahn, Juno M; Flake, Gordon P et al. (2015) Toward predicting metastatic progression of melanoma based on gene expression data. Pigment Cell Melanoma Res 28:453-63

Zhang, Xiaoli; Li, Bing; Li, Wenguo et al. (2014) Transcriptional repression by the BRG1-SWI/SNF complex affects the pluripotency of human embryonic stem cells. Stem Cell Reports 3:460-74

Niu, Liang; Huang, Weichun; Umbach, David M et al. (2014) IUTA: a tool for effectively detecting differential isoform usage from RNA-Seq data. BMC Genomics 15:862

Huang, Weichun; Loganantharaj, Rasiah; Schroeder, Bryce et al. (2013) PAVIS: a tool for Peak Annotation and Visualization. Bioinformatics 29:3097-9

Huang, Weichun; Li, Leping; Myers, Jason R et al. (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28:593-4

Abell, Amy N; Jordan, Nicole Vincent; Huang, Weichun et al. (2011) MAP3K4/CBP-regulated H2B acetylation controls epithelial-mesenchymal transition in trophoblast stem cells. Cell Stem Cell 8:525-37

Xu, Mengyuan; Weinberg, Clarice R; Umbach, David M et al. (2011) coMOTIF: a mixture framework for identifying transcription factor and a coregulator motif in ChIP-seq data. Bioinformatics 27:2625-32

Mercier, Eloi; Droit, Arnaud; Li, Leping et al. (2011) An integrated pipeline for the genome-wide analysis of transcription factor binding sites from ChIP-Seq. PLoS One 6:e16432

Showing the most recent 10 out of 29 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: