The signal elements in promoter sequences are not well characterized. Initially, Dr. Mario-Ramrez collected a database of about 4700 sequences around the TSS of human genes, later increasing the size of the database by about a factor of 2. We then developed tests based on (maximal segment) score statistics to find nucleotide words (generally of length 8) that appear localized relative to TSSs (transcription start sites). A-GLAM used these words as """"""""seeds"""""""" for expansion to develop PSSMs (position-specific scoring matrices) characterizing systems of co-regulated genes. About 80 of these words occurred in two or three clusters. By validating our results with microarray data and gene ontology information, we were able to show that the same 8-letter word could have two different biological functions, depending on its position with respect to the TSS. Although positional dependency of function is a known phenomenon, our study showed that it is widespread in the human genome. In addition, with gold standard datasets and rigorous statistical tests, Drs. Spouge and Kim showed that Markov models and positional information improve transcription factor binding site (TFBS) prediction significantly (although not yet to practical accuracies). Moreover, they showed that the Markov models used in extant TFBS programs is inferior, both theoretically and practically, to the theoretically correct Markov model they proposed. Our publicly available program A-GLAM implements positional information and the theoretically sound Markov models to find TFBS motifs. Tatiana Orlova (Volunteer Jun-Jul 2009) and Narayan Perumal (Visitor Jul 2009) collaborated in using AGLAM to investigate possible TFBSs for the TOL-like receptors important to the immune response. Dr. Spouge and Ms. Acevedo-Luna are presently extending the statistical methods to the known TFBS motifs in the JASPAR database, to categorize motifs according to their positional preference, and to discover combinations of the TFBSs in putative cis-regulatory modules by their positional preferences. Drs. Kim and Spouge have also developed a model for calling peaks in ChIP-seq data, to identify PFBSs from experimental data.

Project Start
Project End
Budget Start
Budget End
Support Year
9
Fiscal Year
2012
Total Cost
$303,689
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Acevedo-Luna, Natalia; Mariño-Ramírez, Leonardo; Halbert, Armand et al. (2016) Most of the tight positional conservation of transcription factor binding sites near the transcription start site reflects their co-localization within regulatory modules. BMC Bioinformatics 17:479
Kim, Nak-Kyeong; Jayatillake, Rasika V; Spouge, John L (2013) NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data. BMC Genomics 14:349
Mariño-Ramírez, Leonardo; Tharakaraman, Kannan; Spouge, John L et al. (2009) Promoter analysis: gene regulatory motif identification with A-GLAM. Methods Mol Biol 537:263-76
Kim, Nak-Kyeong; Tharakaraman, Kannan; Marino-Ramirez, Leonardo et al. (2008) Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites. BMC Bioinformatics 9:262
Tharakaraman, Kannan; Bodenreider, Olivier; Landsman, David et al. (2008) The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site. Nucleic Acids Res 36:2777-86