The LDSB investigates the organization and activities of developmental regulatory networks using formation of the Drosophila embryonic heart and body wall muscles as a model system. The overarching goal of this work is to comprehensively identify and characterize the upstream regulators of cell fate specification, the downstream effectors of differentiation, and the complex functional interactions that occur among these components during organogenesis. To achieve this objective, we combine contemporary genome-wide experimental and computational approaches with classical genetics and embryology to generate mechanistic hypotheses that we then test at single cell resolution in the intact organism. In collaboration with Dr. Ivan Ovcharenko's group at NCBI, we have been developing an unbiased view of mesodermal transcriptional regulatory codes using a machine learning algorithm to identify shared sequence features derived from a training set of functionally related enhancers. This strategy also identifies novel sequence motifs that may contribute to enhancer activity, thereby facilitating the characterization of additional transcription factors (TFs) that participate in a combinatorial manner within a particular gene regulatory network. The application of a machine learning approach requires a robust set of related enhancers as a starting point. We have previously validated the activities of 13 predicted FC enhancers, but even when combined with examples from the literature, the total available number is insufficient to train a classifier due to the risk of overfitting the decision rules. Thus, we sought to increase the size of the training set by incorporating orthologous sequences from other Drosophila species. To this end, our collaborators developed and we empirically validated a phylogenetic profiling method that provided an additional 24 orthologous enhancers. These cis-regulatory elements were then used to train a classifier that reliably distinguishes between FC enhancers and control sequences. Applying this classifier genome-wide, we predicted 5,500 candidate FC enhancers at a false-positive rate of 5%. Analysis of the positively weighted motifs identified by the classifier revealed that the motif signature of FC enhancers is complex and includes binding sites for known myogenic TFs. Numerous de novo motifs were also identified, thereby expanding the FC cis-regulatory code and suggesting novel TFs as candidate regulators of FC gene expression. In one case, we have undertaken both cis and trans experiments to confirm the function of a T-box family member as a key regulator of muscle FC identity. We have applied a similar phylogenetic profiling and machine learning strategy to study Drosophila heart gene regulatory networks. Starting with 24 validated enhancers having activity in the cardiac mesoderm, cardial cells (CCs) and/or pericardial cells (PCs) of the mature heart, plus 26 orthologous sequences, we developed a classifier having a high sensitivity and specificity for the training set. As for FC genes, we observed a strong correlation between top-scoring enhancer predictions and known cardiac gene expression. We also developed separate classifiers for PC- and CC-specific enhancers. Collectively, these analyses revealed that binding sites for established cardiac TFs, such as Tinman, Mef2 and Hand, are among the highly-ranked motifs. Other sequence motifs were identified as putatively involved in cardiac regulatory codes, all of which bind to plausible TFs controlling heart gene expression. Experiments are in progress to evaluate the functions of these candidate TF binding sites and to identify the corresponding trans-acting factors.
|Busser, Brian W; Taher, Leila; Kim, Yongsok et al. (2012) A machine learning approach for identifying novel cell type-specific transcriptional regulators of myogenesis. PLoS Genet 8:e1002531|