This subproject is one of many research subprojects utilizing theresources provided by a Center grant funded by NIH/NCRR. The subproject andinvestigator (PI) may have received primary funding from another NIH source,and thus could be represented in other CRISP entries. The institution listed isfor the Center, which is not necessarily the institution for the investigator.The characterization and interaction of transcriptional regulatory elements is fundamental to understanding how eukaryotic gene networks operate. Identification of the underlying transcription factors (TFs) and their target sequences is crucial to characterizing these regulatory elements. Many computational methods developed to locate TF binding sites (TFBS) have relied on information from previously characterized sites. Yet, the majority of TFs do have not binding site profiles, suggesting that these methods may not find uncharacterized sites. We propose a new approach to identify TFBSs in a set of unaligned sequences with no prior binding site information and with an emphasis on discovering new functional binding sites. Our method compiles positional weight matrices from a set of regulatory sequences taken from co-regulated or tissue-specific genes. These matrices are then used to find statistically over-represented motifs in the input sequences, relative to the rest of the organisms intergenic genome. Unlike other published TFBS discovery methods, our approach estimates underlying probabilities using a 'brute-force' approach that requires substantial computational time. We have tested our approach on a set of well-characterized Drosophila development genes and our results indicate this method effectively predicts known binding sites and also identifies DNA regions that contain promising TFBS candidates. Although we have very positive results with our method, we have not tested it directly against other similar methods (e.g., WeederWeb) using identical data sets and metrics of success. Therefore, we request teragrid computational time to perform a series of rigorous tests of our method using published data sets from other studies. We also plan to analyze a set of co-regulated genes from the seq-squirt. Our code compiles (gcc) and runs successfully on both Linux and PC platforms.
Showing the most recent 10 out of 292 publications