cis-Regulatory modules (enhancers) are genomic DNA fragments that contain multiple binding sites for sequence specific DNA-binding transcription factors that collectively control the temporal and spatial expression dynamics of flanking genes. While DNA sequence alignments between Drosophila melanogaster genes and their orthologous DNAs outside the genus are of limited use in identifying enhancers, the additive evolutionary divergence among 12 Drosophila species is of great utility for identifying functional conserved sequences within enhancers. For example, all Drosophila enhancers characterized thus far contain multiple conserved sequence blocks (CSBs), made up of DNA-binding sites for known and as yet unidentified transcriptional regulators. Comparative genomic analysis among vertebrates also reveals that many of their enhancers contain CSBs. Recent studies have demonstrated that co-regulating enhancers share conserved sequence elements. We have developed computer algorithms to identify repeat sequences within CSB clusters and to search for co-regulating enhancers throughout the Drosophila genome based on their shared conserved sequence elements. Our genome-wide CSC database currently consists of over 100,000 CSB clusters obtained from evolutionary gene prints that span 90% of the Drosophila genome. Alignment-search algorithms were designed to scan this database to detect related enhancers by a multi-step protocol: Conserved repeat elements within an input enhancer are identified and then CSCs with the same repeated sequences as the input CSC are identified. Via one-on-one alignments, the database CSCs are ranked in the order of their shared sequence elements with the input enhancer. This method has several advantages over previous enhancer discovery methods: 1) it makes no assumptions about the function of the conserved sequences -- over 50% of the shared sequences do not represent DNA binding sites for known transcription factors, 2) it requires no a priori knowledge of the functional elements in a given CSB cluster, and 3) it allows the user to focus on genes that are co-expressed in any given biological event, e.g. neural stem cell lineage development, to discover functionally related neuron identity genes via their co-regulating enhancers. We believe that the CSC database and search algorithms will become part of the next generation of tools for the discovery and analysis of Drosophila cis-regulatory DNA sequences. This methodology will also serve as a model for identifying functionally related enhancers in other model systems such as mammalian cis-regulatory DNAs. To extend the use of these tools to mammalian cis-regulation, we are currently generating a mouse CSC database covering its entire genome. CSCs parsed from 1.03 million EvoPrints that span the mouse 2.6 billion base pair genome using either 400 million or 1.0 billion cumulative evolutionary divergence (CED) years can be searched independently using the alignment algorithms developed for fly enhancer recognition. Analysis of a training set of 57 known mammalian enhancers reveals that all contain CSCs with CED values of >400 My and most have ultra-conserved core sequences with CED values in excess of 1.0 By.
|Kundu, Mukta; Kuzin, Alexander; Lin, Tzu-Yang et al. (2013) cis-regulatory complexity within a large non-coding region in the Drosophila genome. PLoS One 8:e60137|