The informatics core will continue to deliver to the community high quality, well-structured datasets with complete metadata along with comprehensive data analysis. To achieve this, we have developed bioinformatics pipelines to process and validate our ChIP-seq and RNA-seq data and worked extensively with the ENCODE DCC to curate our metadata to make our data easily accessible. The ChIP-seq pipeline has been used to call both narrow and broad peaks and to annotate HOT regions and TF binding sites in worm and fly across varying samples and stages; the RNA-seq pipeline has been used to identify differentially expressed genes under various conditions, such as different developmental stages and TF mutants, and we will evaluate TF binding sites associated with these genes. Although these pipelines have been set up and tested thoroughly, we aim to further optimize them; for instance, a new method is being developed to call ChIP-seq peaks using multiple types of controls. To our knowledge, no such peak caller exists. To integrate and analyze our data, we will develop a mini-encyclopedia with three levels of annotations, similar to the encyclopedia developed through the ENCODE project. The ground level will consist of the gene expression, TF binding and histone modification data in worm and fly. Based on our preliminary results, we have developed advanced statistical models to identify functional genomic regions, such as enhancers and HOT regions, etc. We will deposit these results into the middle annotation level. The top level will contain linkages of genes and their regulators, predicted by our models. The regulators include both cis- and trans-regulatory elements, such as enhancers and TFs. Moreover, the linkages will be integrated to form temporal or spatial networks.
We aim to identify key regulatory factors by comparing the structure of the networks. We will share all of our datasets, analysis results, and worm and fly strains with the community through the appropriate public databases.