The ENCODE projects have generated a wealth of high-quality genomic datasets with the applications of high- throughput next generation sequencing (NGS) to create a catalog of functional elements in the human and model organism genomes. Although the NGS technologies, embraced by ENCODE, are enabling interrogation of genomes in an unbiased manner, the data analysis efforts of the ENCODE projects have thus far focused on mappable regions of the genomes and thereby have not fully leveraged these data to their full advantage. A major bottleneck to a comprehensive understanding of data from the ENCODE projects is the lack of statistical and computational methods that can identify functional elements in repetitive regions. We will address this critical impediment in four specifi aims by building on our expertise in ChIP-seq and RNA-seq analysis.
In Aim 1, we will develop probabilistic models and accompanying software for utilizing reads that map to multiple locations on the genome (multi-reads) from multiple types of *-seq datasets (ChIP-, DNase-, MeDIP-, and FAIRE-seq). This will enable cataloging of regulatory elements in repetitive regions.
In Aim 2, we will improve the specificity of the discoveries in repetitive regions from ou probabilistic models by utilizing multiple related *- seq datasets simultaneously. Specifically, we will devise methods to supervise analysis of ChIP- and RNA-seq datasets by external ChIP-seq datasets. This will facilitate accurate inference for repetitive elements with near identical sequences, e.g., segmental duplications, long interspersed nuclear elements, and boost accuracy of gene and isoform quantification with RNA-seq.
In Aim 3, we will focus on identifying co-occupied/enriched regions to infer cell-specific modules of regions/genes and their regulatory profiles. We will also develop a formal differential co-enrichment framework to study cell-specific wiring and interactions of regulatory factors. This will elucidate how interactions among regulatory factors vary across cells/tissues/conditions.
Aim 4, we will apply our methods from Aims 1-3 to relevant ENCODE data to understand GATA factor functions in hematopoiesis and vascular biology. The GATA system in human and mouse will serve as a training and validation platform for our methods. Statistical and computational resources generated from the project, which will be disseminated as modular and robust software, will help to enhance and maximize the impact of ENCODE-derived data on the biomedical research community.
The ENCODE projects have generated a wealth of high-quality functional genomic datasets with the applications of high-throughput next generation sequencing (NGS) to create a catalog of functional elements in the human and model organism genomes. A central limitation to a comprehensive understanding of these ENCODE data from the point of development, differentiation, and disease is the lack of statistical and computational methods that can identify functional elements in repetitive regions of the genomes. In this proposal, we will develop statistical and computational methods that can fully leverage ENCODE-derived data to their full advantage and catalog functional repetitive regions of the genomes.
Showing the most recent 10 out of 22 publications