The full annotation of an organism's genome requires the systematic identification of cis-regulatory sequences and the trans-acting factors that bind them. For all organisms, a significant remaining impediment to this goal is the limited number of transcription factors (TFs) with well-characterized DNA-binding specificities. We have developed a bacterial one-hybrid system that provides a rapid method to characterize the DNA-binding specificities of TFs. Using this technology, we have determined the specificity of 15% (108/~750) of all of the predicted sequence-specific transcription factors in Drosophila melanogaster. This catalog of specificities includes proteins representing 12 different types of DNA-binding domains and all 84 independent homeodomain family members. To complement this dataset we have developed computational tools that map the genomic distribution of TF binding site frequencies and use this information to identify putative cis-regulatory modules (CRMs) for any combination of TFs in our dataset. A web-based interface allows users to perform genome-wide searches for CRMs or to display binding site frequencies for TFs or combinations of TFs as tracks within the popular Gbrowse interface. We now propose to characterize the DNA-binding specificity of all remaining D. melanogaster TFs, including all monomeric and homo-oligomeric TFs as well as all functional heterodimeric combinations from the basic leucine zipper and basic helix-loop-helix families. We will also refine our computational tools to improve their ability to distinguish CRMs within the genome and we will integrate other data sources (e.g. ChIP-chip datasets) to enhance the ability to predict CRMs. This effort will culminate in the development of web-accessible database and search tools that will allow the scientific community to computationally identify putative CRMs that are regulated by any combination of factors of interest. An outgrowth of our analysis will be genome-wide annotations of CRMs for subsets of factors that function in known transcriptional regulatory networks. To date, a complete description of TF specificities has not been obtained in any organism. Combined with improved computational tools and the extensive and growing body of experimental studies on D. melanogaster transcription, a catalog of TF specificities will allow the systematic annotation of CRMs throughout its genome. Once developed, these databases and tools should be directly applicable to the annotation of CRMs in other organisms, including humans.
Although the genome project has extensively mapped which DNA sequences in humans and other organisms encode genes, mapping the regulatory regions that turn genes on and off has proven to be much more difficult. We will use newly developed experimental and computational tools to systematically map these control elements in an entire genome. This new genome """"""""map"""""""" will help researchers understand how these elements function in normal cells and how mutations in these elements can lead to disease.