ChIP-seq and ChIP-chip, hereinafter referred to as ChIPx, are powerful technologies to map genome-wide protein-DNA interactions (PDIs). Microarray, exon array and RNA-seq, on the other hand, are widely used to measure gene expression. Integrating ChIPx and gene expression data provides a powerful approach to study gene regulation both during development and in diseases. Traditionally, ChIPx and gene expression experiments conducted by a single laboratory are mainly used to study a specific biological system. The collective efforts of many labs have resulted in a large volume of data representing diverse biological systems. Jointly, these data contain enormous amounts of information that have not been fully utilized by each individual lab. This proposal aims to develop a coordinated set of computational, statistical and software tools to allow scientists to synthesize information in 3000+ publicly available ChIPx samples and 60,000+ gene expression profiles in human and mouse to make new discoveries. The project will turn these heterogeneous data into a tool for high-throughput discovery of biological contexts (i.e., cell types, tissues and diseases) associated with gene regulatory pathway activities. First, a statistical method named Gene Set Context Analysis (GSCA) will be developed. GSCA utilizes large amounts of public gene expression data to infer biological contexts and diseases in which one or more gene sets (i.e., groups of genes) are coordinately activated or inactivated. Second, based on the GSCA, a method called Transcription Factor Context Analysis (TFCA) will be developed. TFCA discovers novel functional contexts of transcription factors (TFs) and gene regulatory pathways. This method first classifies target genes of a TF into different functional categories by integrating one's own ChIPx and gene expression data with public ChIPx and Gene Ontology data. It then uses GSCA to systematically discover biological contexts (including diseases) associated with the function of each category. Collectively, GSCA and TFCA will establish a new paradigm for analyzing ChIPx and gene expression data. The conventional approach analyzes data tied to a particular system. In the new approach, one also leverages the rich information in public ChIPx and gene expression data to extend findings in one system to other biological systems. By allowing one to make novel discoveries beyond the scope of the original experiments and connect gene regulatory pathways to diseases, the new approach will significantly increase the value of both new and existing data. Applying GSCA and TFCA, 3000+ ChIPx samples and 60,000+ gene expression samples in human and mouse will be analyzed together to systematically map TF functions and ChIPx defined regulatory pathway activ- ities to diseases. Some new predictions will be validated experimentally. In addition to creating new knowledge about a variety of diseases, this research will provide urgently needed data integration and data mining tools to help scientists to translate the rich information in the publicly available ChIPx and gene expression data into new discoveries, and identify promising new areas of biomedical research.

Public Health Relevance

The publicly available genomic data on gene expression and protein-DNA interactions contain enormous amounts of information that have not been fully utilized. This proposal develops computational, statistical and software tools to extract the information and applies these tools to systematically discover novel connections between genes and biological pathways to diseases. The findings will increase our understanding of a variety of diseases and point to promising new areas of biomedical research.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Pazin, Michael J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Ji, Zhicheng; Ji, Hongkai (2016) TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res 44:e117
Wang, Jie; Xia, Shuli; Arand, Brian et al. (2016) Single-Cell Co-expression Analysis Reveals Distinct Functional Modules, Co-regulation Mechanisms and Clinical Outcomes. PLoS Comput Biol 12:e1004892
Li, Qiang; Lex, Rachel K; Chung, HaeWon et al. (2016) The Pluripotency Factor NANOG Binds to GLI Proteins and Represses Hedgehog-mediated Transcription. J Biol Chem 291:7171-82
Ji, Zhicheng; Vokes, Steven A; Dang, Chi V et al. (2016) Turning publicly available gene expression data into discoveries using gene set context analysis. Nucleic Acids Res 44:e8
Jin, Kideok; Park, Sunju; Teo, Wei Wen et al. (2015) HOXB7 Is an ERα Cofactor in the Activation of HER2 and Multiple ER Target Genes Leading to Endocrine Resistance. Cancer Discov 5:944-59
Wei, Yingying; Tenzen, Toyoaki; Ji, Hongkai (2015) Joint analysis of differential gene expression in multiple studies using correlation motifs. Biostatistics 16:31-46
Wu, Hao; Ji, Hongkai (2014) PolyaPeak: detecting transcription factor binding sites from ChIP-seq using peak shape information. PLoS One 9:e89694
Newman, Robert H; Hu, Jianfei; Rho, Hee-Sool et al. (2013) Construction of human activity-based phosphorylation networks. Mol Syst Biol 9:655
Wu, George; Yustein, Jason T; McCall, Matthew N et al. (2013) ChIP-PED enhances the analysis of ChIP-seq and ChIP-chip data. Bioinformatics 29:1182-9
Wang, Jiayi; Park, Joo-Seop; Wei, Yingying et al. (2013) TRIB2 acts downstream of Wnt/TCF in liver cancer cells to regulate YAP and C/EBPα function. Mol Cell 51:211-25

Showing the most recent 10 out of 11 publications