ChIP-seq and ChIP-chip, hereinafter referred to as ChIPx, are powerful technologies to map genome-wide protein-DNA interactions (PDIs). Microarray, exon array and RNA-seq, on the other hand, are widely used to measure gene expression. Integrating ChIPx and gene expression data provides a powerful approach to study gene regulation both during development and in diseases. Traditionally, ChIPx and gene expression experiments conducted by a single laboratory are mainly used to study a specific biological system. The collective efforts of many labs have resulted in a large volume of data representing diverse biological systems. Jointly, these data contain enormous amounts of information that have not been fully utilized by each individual lab. This proposal aims to develop a coordinated set of computational, statistical and software tools to allow scientists to synthesize information in 3000+ publicly available ChIPx samples and 60,000+ gene expression profiles in human and mouse to make new discoveries. The project will turn these heterogeneous data into a tool for high-throughput discovery of biological contexts (i.e., cell types, tissues and diseases) associated with gene regulatory pathway activities. First, a statistical method named Gene Set Context Analysis (GSCA) will be developed. GSCA utilizes large amounts of public gene expression data to infer biological contexts and diseases in which one or more gene sets (i.e., groups of genes) are coordinately activated or inactivated. Second, based on the GSCA, a method called Transcription Factor Context Analysis (TFCA) will be developed. TFCA discovers novel functional contexts of transcription factors (TFs) and gene regulatory pathways. This method first classifies target genes of a TF into different functional categories by integrating one's own ChIPx and gene expression data with public ChIPx and Gene Ontology data. It then uses GSCA to systematically discover biological contexts (including diseases) associated with the function of each category. Collectively, GSCA and TFCA will establish a new paradigm for analyzing ChIPx and gene expression data. The conventional approach analyzes data tied to a particular system. In the new approach, one also leverages the rich information in public ChIPx and gene expression data to extend findings in one system to other biological systems. By allowing one to make novel discoveries beyond the scope of the original experiments and connect gene regulatory pathways to diseases, the new approach will significantly increase the value of both new and existing data. Applying GSCA and TFCA, 3000+ ChIPx samples and 60,000+ gene expression samples in human and mouse will be analyzed together to systematically map TF functions and ChIPx defined regulatory pathway activ- ities to diseases. Some new predictions will be validated experimentally. In addition to creating new knowledge about a variety of diseases, this research will provide urgently needed data integration and data mining tools to help scientists to translate the rich information in the publicly available ChIPx and gene expression data into new discoveries, and identify promising new areas of biomedical research.

Public Health Relevance

The publicly available genomic data on gene expression and protein-DNA interactions contain enormous amounts of information that have not been fully utilized. This proposal develops computational, statistical and software tools to extract the information and applies these tools to systematically discover novel connections between genes and biological pathways to diseases. The findings will increase our understanding of a variety of diseases and point to promising new areas of biomedical research.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG006282-01A1
Application #
8372529
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Pazin, Michael J
Project Start
2012-07-25
Project End
2017-04-30
Budget Start
2012-07-25
Budget End
2013-04-30
Support Year
1
Fiscal Year
2012
Total Cost
$419,462
Indirect Cost
$138,320
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21218
Wu, Hao; Ji, Hongkai (2014) PolyaPeak: detecting transcription factor binding sites from ChIP-seq using peak shape information. PLoS One 9:e89694
Wang, Jiayi; Park, Joo-Seop; Wei, Yingying et al. (2013) TRIB2 acts downstream of Wnt/TCF in liver cancer cells to regulate YAP and C/EBP* function. Mol Cell 51:211-25
Wu, George; Yustein, Jason T; McCall, Matthew N et al. (2013) ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data. Bioinformatics 29:1182-9
Newman, Robert H; Hu, Jianfei; Rho, Hee-Sool et al. (2013) Construction of human activity-based phosphorylation networks. Mol Syst Biol 9:655
Wu, George; Ji, Hongkai (2013) ChIPXpress: using publicly available gene expression data to improve ChIP-seq and ChIP-chip target gene ranking. BMC Bioinformatics 14:188