The signal elements in promoter sequences are not well characterized. Around 2004, Dr. Mario-Ramrez collected a database of about 4700 sequences around the TSS of human genes, and in 2008, he increased the size of the database by about a factor of 2. We then developed tests based on (maximal segment) score statistics to find nucleotide words (generally of length 8) that appear localized relative to TSSs (transcription start sites). About 80 of these words occurred in two or three clusters. By validating our results with microarray data and gene ontology information, we were able to show that the same 8-letter word could have two different biological functions, depending on its position with respect to the TSS. Although positional dependency of sequence function is a known phenomenon, our study showed that it is widespread in the human genome. We implemented our methods, which use positional information and theoretically sound Markov models, in the publicly available program A-GLAM, which finds transcription factor binding sites (TFBS) motifs. Drs. Mario-Ramrez and Spouge and Ms. Acevedo-Luna have extended the statistical methods for words to known TFBS motifs in the JASPAR database, to categorize JASPAR motifs according to their positional preference, to use positional preference to discover combinations of TFBSs in putative cis-regulatory modules, and to assign function from the Gene Ontology Database. A manuscript in preparation demonstrates that position relative to the transcription start site influences the function of transcription factor binding motifs. Drs. Kim, Jayatillake, and Spouge have also developed a model for calling peaks in ChIP-seq data, to identify TFBSs from experimental data, implemented it in a publicly available program (NEXT-Peak). They are currently extending their work to include the possibility of several peaks in a short sequence of DNA.
|Kim, Nak-Kyeong; Jayatillake, Rasika V; Spouge, John L (2013) NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data. BMC Genomics 14:349|
|Marino-Ramirez, Leonardo; Tharakaraman, Kannan; Spouge, John L et al. (2009) Promoter analysis: gene regulatory motif identification with A-GLAM. Methods Mol Biol 537:263-76|
|Kim, Nak-Kyeong; Tharakaraman, Kannan; Marino-Ramirez, Leonardo et al. (2008) Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 9:262|
|Tharakaraman, Kannan; Bodenreider, Olivier; Landsman, David et al. (2008) The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Res 36:2777-86|