? ? Computational analysis of various aspects of gene regulation, transcription factor binding in particular, is an important and well known problem. Adequately addressed, it would greatly improve our understanding of diseases and help with the development of treatments. However, despite the intensive efforts and the application of sophisticated models, the identification of the binding motifs in DMA sequences remains elusive. The raw sequence likely carries only a part of the regulatory signal, and it is often too short and subtle to be detected even by the most sensitive algorithms. Many software tools developed for this purpose exploit the clustering and over-representation of motifs in promoter regions, sometimes combining this method with other experimental or phylogenetic information. However, the fact that many short sequences appear to be over-represented in any segment of DNA, at least in comparison with completely random model, impedes the reliable discovery. ? ? This proposal seeks support to develop new software and apply it to the identification, visualization and analysis of repeated short (approximately 5-25 bases) degenerate motifs, in short (a few hundred bases) and long (entire chromosomes) DNA sequences. We intend to use this software on the human and other genomes, as well as on sequences of interest to our collaborators in biology and chemistry, in an attempt to systematically characterize short over-represented sequences. We shall determine which of these motifs correspond to the experimentally confirmed transcription factor binding consensuses, study their phylogenetic conservation and investigate their possible association with repeat families. Special attention will be paid to the upstream sequences of genes, and tools will be developed for a genome-wide search for related motif layouts. ? ? Our software will be based on an adaptation of classic string processing algorithms to address the inexact matches in a novel way, by combining the seed elements into statistically significant degenerate motifs. In addition to performing analysis with our collaborators, we will place the programs in the public domain, along with the other tools which we have already developed and published, inviting other investigators to use them on their own data. ? ? ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Small Research Grants (R03)
Project #
1R03LM009033-01A1
Application #
7196374
Study Section
Special Emphasis Panel (ZLM1-ZH-S (O1))
Program Officer
Ye, Jane
Project Start
2007-05-01
Project End
2009-04-30
Budget Start
2007-05-01
Budget End
2008-04-30
Support Year
1
Fiscal Year
2007
Total Cost
$74,000
Indirect Cost
Name
University of Texas Arlington
Department
Biostatistics & Other Math Sci
Type
Schools of Engineering
DUNS #
064234610
City
Arlington
State
TX
Country
United States
Zip Code
76019
Stojanovic, Nikola; Singh, Abanish (2010) Exploring motif composition of eukaryotic promoter regions. Adv Exp Med Biol 680:27-34
Singh, Abanish; Keswani, Umeshkumar; Levine, David et al. (2010) An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes. Int J Bioinform Res Appl 6:147-62
Stojanovic, Nikola (2009) A Study of the Distribution of Phylogenetically Conserved Blocks within Clusters of Mammalian Homeobox Genes. Genet Mol Biol 32:666-673
Singh, Abanish; Feschotte, Cedric; Stojanovic, Nikola (2007) A study of the repetitive structure and distribution of short motifs in human genomic sequences. Int J Bioinform Res Appl 3:523-35