The signal elements in promoter sequences are not well characterized. Around 2004, Dr. Mario-Ramrez collected a database of about 4700 sequences around the TSS of human genes, and in 2008, he increased the size of the database by about a factor of 2. We then developed tests based on (maximal segment) score statistics to find nucleotide words (generally of length 8) that appear localized relative to TSSs (transcription start sites). About 80 of these words occurred in two or three clusters. By validating our results with microarray data and gene ontology information, we were able to show that the same 8-letter word could have two different biological functions, depending on its position with respect to the TSS. Although positional dependency of sequence function is now accepted, our study was one of the first to show that it is a widespread phenomenon in the human genome. We implemented our methods, which use positional information and theoretically sound Markov models, in the publicly available program A-GLAM, which was one of the first to find transcription factor binding sites (TFBS) motifs using both sequence and position. Dr. Mario-Ramirez has now increased the size of our database to 29,204 sequences. Dr. Spouge and Ms. Acevedo-Luna have extended the statistical methods for words to known TFBS motifs in the JASPAR database, to categorize JASPAR motifs according to their positional preference, to use positional preference to discover pairs of TFBSs in putative cis-regulatory modules, and to assign function from the Gene Ontology Database. Dr Hansen has examined and compared the results to the biological literature, with finding several confirmed and new transcription factor motifs with conserved positions with position preferences with respect to the TSS.

Project Start
Project End
Budget Start
Budget End
Support Year
13
Fiscal Year
2016
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Acevedo-Luna, Natalia; Mariño-Ramírez, Leonardo; Halbert, Armand et al. (2016) Most of the tight positional conservation of transcription factor binding sites near the transcription start site reflects their co-localization within regulatory modules. BMC Bioinformatics 17:479
Kim, Nak-Kyeong; Jayatillake, Rasika V; Spouge, John L (2013) NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data. BMC Genomics 14:349
Mariño-Ramírez, Leonardo; Tharakaraman, Kannan; Spouge, John L et al. (2009) Promoter analysis: gene regulatory motif identification with A-GLAM. Methods Mol Biol 537:263-76
Kim, Nak-Kyeong; Tharakaraman, Kannan; Marino-Ramirez, Leonardo et al. (2008) Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 9:262
Tharakaraman, Kannan; Bodenreider, Olivier; Landsman, David et al. (2008) The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Res 36:2777-86