The ENCODE projects have generated a wealth of high-quality genomic datasets with the applications of high- throughput next generation sequencing (NGS) to create a catalog of functional elements in the human and model organism genomes. Although the NGS technologies, embraced by ENCODE, are enabling interrogation of genomes in an unbiased manner, the data analysis efforts of the ENCODE projects have thus far focused on mappable regions of the genomes and thereby have not fully leveraged these data to their full advantage. A major bottleneck to a comprehensive understanding of data from the ENCODE projects is the lack of statistical and computational methods that can identify functional elements in repetitive regions. We will address this critical impediment in four specifi aims by building on our expertise in ChIP-seq and RNA-seq analysis.
In Aim 1, we will develop probabilistic models and accompanying software for utilizing reads that map to multiple locations on the genome (multi-reads) from multiple types of *-seq datasets (ChIP-, DNase-, MeDIP-, and FAIRE-seq). This will enable cataloging of regulatory elements in repetitive regions.
In Aim 2, we will improve the specificity of the discoveries in repetitive regions from ou probabilistic models by utilizing multiple related *- seq datasets simultaneously. Specifically, we will devise methods to supervise analysis of ChIP- and RNA-seq datasets by external ChIP-seq datasets. This will facilitate accurate inference for repetitive elements with near identical sequences, e.g., segmental duplications, long interspersed nuclear elements, and boost accuracy of gene and isoform quantification with RNA-seq.
In Aim 3, we will focus on identifying co-occupied/enriched regions to infer cell-specific modules of regions/genes and their regulatory profiles. We will also develop a formal differential co-enrichment framework to study cell-specific wiring and interactions of regulatory factors. This will elucidate how interactions among regulatory factors vary across cells/tissues/conditions.
Aim 4, we will apply our methods from Aims 1-3 to relevant ENCODE data to understand GATA factor functions in hematopoiesis and vascular biology. The GATA system in human and mouse will serve as a training and validation platform for our methods. Statistical and computational resources generated from the project, which will be disseminated as modular and robust software, will help to enhance and maximize the impact of ENCODE-derived data on the biomedical research community.

Public Health Relevance

The ENCODE projects have generated a wealth of high-quality functional genomic datasets with the applications of high-throughput next generation sequencing (NGS) to create a catalog of functional elements in the human and model organism genomes. A central limitation to a comprehensive understanding of these ENCODE data from the point of development, differentiation, and disease is the lack of statistical and computational methods that can identify functional elements in repetitive regions of the genomes. In this proposal, we will develop statistical and computational methods that can fully leverage ENCODE-derived data to their full advantage and catalog functional repetitive regions of the genomes.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01HG007019-03
Application #
8687990
Study Section
Special Emphasis Panel (ZHG1)
Program Officer
Gilchrist, Daniel A
Project Start
2012-09-17
Project End
2015-06-30
Budget Start
2014-07-01
Budget End
2015-06-30
Support Year
3
Fiscal Year
2014
Total Cost
Indirect Cost
Name
University of Wisconsin Madison
Department
Biostatistics & Other Math Sci
Type
Schools of Medicine
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715
Mehta, Charu; Johnson, Kirby D; Gao, Xin et al. (2017) Integrating Enhancer Mechanisms to Establish a Hierarchical Blood Development Program. Cell Rep 20:2966-2979
Welch, Rene; Chung, Dongjun; Grass, Jeffrey et al. (2017) Data exploration, quality control and statistical analysis of ChIP-exo/nexus experiments. Nucleic Acids Res 45:e145
Shin, Sunyoung; Kele?, Sündüz (2017) Annotation Regression for Genome-Wide Association Studies with an Application to Psychiatric Genomic Consortium Data. Stat Biosci 9:50-72
Bernstein, Matthew N; Doan, AnHai; Dewey, Colin N (2017) MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive. Bioinformatics 33:2914-2923
Kreimer, Anat; Zeng, Haoyang; Edwards, Matthew D et al. (2017) Predicting gene expression in massively parallel reporter assays: A comparative study. Hum Mutat 38:1240-1250
Zuo, Chandler; Chen, Kailei; Hewitt, Kyle J et al. (2016) A Hierarchical Framework for State-Space Matrix Inference and Clustering. Ann Appl Stat 10:1348-1372
Liu, Peng; Sanalkumar, Rajendran; Bresnick, Emery H et al. (2016) Integrative analysis with ChIP-seq advances the limits of transcript quantification from RNA-seq. Genome Res 26:1124-33
Papale, Ligia A; Li, Sisi; Madrid, Andy et al. (2016) Sex-specific hippocampal 5-hydroxymethylcytosine is disrupted in response to acute stress. Neurobiol Dis 96:54-66
Zhang, Qi; Zeng, Xin; Younkin, Sam et al. (2016) Systematic evaluation of the impact of ChIP-seq read designs on genome coverage, peak identification, and allele-specific binding detection. BMC Bioinformatics 17:96
Tanimura, Nobuyuki; Miller, Eli; Igarashi, Kazuhiko et al. (2016) Mechanism governing heme synthesis reveals a GATA factor/heme circuit that controls differentiation. EMBO Rep 17:249-65

Showing the most recent 10 out of 22 publications