The LDSB investigates the organization and activities of developmental regulatory networks using formation of the Drosophila embryonic heart and body wall muscles as a model system. The overarching goal of this work is to comprehensively identify and characterize the upstream regulators of cell fate specification, the downstream effectors of differentiation, and the complex functional interactions that occur among these components during organogenesis. To achieve this objective, we combine contemporary genome-wide experimental and computational approaches with classical genetics and embryology to generate mechanistic hypotheses that we then test at single cell resolution in the intact organism. In collaboration with Dr. Ivan Ovcharenkos group at NCBI, we have been developing an unbiased approach for studying mesodermal transcriptional regulatory codes. Our approach involves applying a machine learning algorithm to identify shared features derived from a training set of functionally related enhancers and relevant large-scale genomic datasets that either we generate ourselves or that can be curated from the existing literature. Such shared features are multidimensional in nature and include the presence of various sequence motifs corresponding to known or predicted transcription factors (TFs), particular histone modifications associated with the enhancer training set, experimental evidence for in vivo TF localization at particular genomic sites, sets of co-expressed genes that may be subject to similar co-regulatory mechanisms, and potential relationships among known TF binding sites found within the enhancer training set, for example, the existence of fixed spacing intervals between functionally significant sites. Based on the initial enhancer classification, new enhancer predictions can be made on a genome-wide scale. This computational strategy also is capable of identifying novel sequence motifs that may contribute to the specificity of enhancer activity, thereby facilitating the characterization of additional transcription factors (TFs) that participate in a combinatorial manner within a particular gene regulatory network. These various computational predictions can be empirically tested in vivo using appropriate transgenic reporter assays, and the results of these experiments can be integrated to build a new and more precise classifier. The goal of this iterative approach is to generate successively more accurate regulatory models that determine the unique genetic programs of individual cell types during development. The application of a machine learning approach requires a robust set of related enhancers as a starting point. We have previously validated the activities of 16 predicted muscle founder cell (FC) enhancers, but even when combined with examples from the literature, the total available number is insufficient to train a classifier due to the risk of overfitting the decision rules. Thus, we sought to increase the size of the training set by incorporating orthologous sequences from other Drosophila species. To this end, our collaborators developed and we empirically validated a phylogenetic profiling method that provided an additional 24 orthologous FC enhancers. The predicted orthologs were found to be active in FCs when assayed in Drosophila melanogaster, although extensive evolutionary shuffling of key TF binding sites occurred while still enabling overall myoblast-specific function to be preserved. The original and orthologous cis-regulatory elements were then combined and used to train a classifier that reliably distinguishes between FC enhancers and control sequences. Applying this classifier on a genome-wide scale, we predicted 5,500 candidate FC enhancers at a false-positive rate of 5%. Moreover, predicted FC enhancers were over-represented in proximity to known FC genes. Analysis of the positively weighted motifs identified by the classifier revealed that the motif signature of FC enhancers is complex and includes binding sites for known myogenic TFs. Numerous de novo motifs were also identified, thereby expanding putative members of the FC cis-regulatory code and suggesting novel TFs as candidate regulators of FC gene expression. In one such case, a combination of cis and trans in vivo testing led to the identification of Org-1, a novel T-box TF, as a previously unrecognized determinant of muscle FC identity. This conclusion was based on the specificity of the effect of Org-1 gain- and loss-of-function on both FC gene regulation and the development of particular muscles in which Org-1 is expressed, as well as on the effects of mutagenizing Org-1 binding sites in known muscle FC enhancers. Additional experiments in which motifs that were highly weighted by the FC enhancer classifier were mutagenized in transgenic reporter assays established that POU homeodomain, Myb, Ets and Forkhead domain binding sites also contribute to FC enhancer activity. Moreover, our analyses revealed an extraordinary degree of combinatorial specificity contributed by the TF binding sites found within a large set of validated FC enhancers. In particular, of 18 FC enhancers included in our studies, no two cis-regulatory elements contained the same set of 12 TF binding site classes, including 3 signal-activated, 1 ubiquitous, 4 tissue-restricted and 4 cell type-specific TF motifs. Collectively, these studies establish that TF binding site combinatorics make a major contribution to the diversity and functional complexity of enhancers having highly related but nonidentical activities in similar cell types within the developing embyro. We have applied a similar phylogenetic profiling and machine learning strategy to study Drosophila heart gene regulatory networks. Starting with 24 validated enhancers having activity in the cardiac mesoderm, cardial cells (CCs) and/or pericardial cells (PCs) of the mature heart, plus 26 orthologous sequences, we developed a classifier having a high sensitivity and specificity for the training set. As for FC genes, we observed a strong correlation between top-scoring enhancer predictions and known cardiac gene expression. We also developed separate classifiers for PC- and CC-specific enhancers. Collectively, these analyses revealed that binding sites for established cardiac TFssuch as Tinman, Mef2 and Handare among the highly-ranked motifs. Other sequence motifs were identified as putatively involved in cardiac regulatory codes, all of which bind to plausible TFs controlling heart gene expression. Experiments are in progress to evaluate the functions of these candidate TF binding sites and to identify the corresponding trans-acting factors. Additional data types such as chromatin marks are also being incorporated into the classifier to improve its overall performance. In a related project, we are implementing machine learning methods to understand the regulatory mechanisms responsible for other subtypes of mesodermal cells for which the contributing factors are less well understood, an approach that is beginning to reveal combinations of TFs that govern unique cellular gene expression responses. Of interest, no single combination of established mesodermal TFs can account for the entire spectrum of gene expression in a population of mesodermal cells that had previously been considered to be relatively homogenous in their identity, a finding that we are currently pursuing in more detail. In summary, the present research strategy involving the integration of genetic, genomic and computational methods is revealing an unexpected degree of combinatorial complexity in the molecular mechanisms underlying the cell type-specific regulation of gene expression in a variety of mesodermal derivatives in the developing Drosophila embryo.

Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
2012
Total Cost
$742,347
Indirect Cost
Name
National Heart, Lung, and Blood Institute
Department
Type
DUNS #
City
State
Country
Zip Code
Busser, Brian W; Taher, Leila; Kim, Yongsok et al. (2012) A machine learning approach for identifying novel cell type-specific transcriptional regulators of myogenesis. PLoS Genet 8:e1002531