ChIP-seq and ChIP-chip, hereinafter referred to as ChIPx, are powerful technologies to map genome-wide protein-DNA interactions (PDIs). Microarray, exon array and RNA-seq, on the other hand, are widely used to measure gene expression. Integrating ChIPx and gene expression data provides a powerful approach to study gene regulation both during development and in diseases. Traditionally, ChIPx and gene expression experiments conducted by a single laboratory are mainly used to study a specific biological system. The collective efforts of many labs have resulted in a large volume of data representing diverse biological systems. Jointly, these data contain enormous amounts of information that have not been fully utilized by each individual lab. This proposal aims to develop a coordinated set of computational, statistical and software tools to allow scientists to synthesize information in 3000+ publicly available ChIPx samples and 60,000+ gene expression profiles in human and mouse to make new discoveries. The project will turn these heterogeneous data into a tool for high-throughput discovery of biological contexts (i.e., cell types, tissues and diseases) associated with gene regulatory pathway activities. First, a statistical method named Gene Set Context Analysis (GSCA) will be developed. GSCA utilizes large amounts of public gene expression data to infer biological contexts and diseases in which one or more gene sets (i.e., groups of genes) are coordinately activated or inactivated. Second, based on the GSCA, a method called Transcription Factor Context Analysis (TFCA) will be developed. TFCA discovers novel functional contexts of transcription factors (TFs) and gene regulatory pathways. This method first classifies target genes of a TF into different functional categories by integrating one's own ChIPx and gene expression data with public ChIPx and Gene Ontology data. It then uses GSCA to systematically discover biological contexts (including diseases) associated with the function of each category. Collectively, GSCA and TFCA will establish a new paradigm for analyzing ChIPx and gene expression data. The conventional approach analyzes data tied to a particular system. In the new approach, one also leverages the rich information in public ChIPx and gene expression data to extend findings in one system to other biological systems. By allowing one to make novel discoveries beyond the scope of the original experiments and connect gene regulatory pathways to diseases, the new approach will significantly increase the value of both new and existing data. Applying GSCA and TFCA, 3000+ ChIPx samples and 60,000+ gene expression samples in human and mouse will be analyzed together to systematically map TF functions and ChIPx defined regulatory pathway activ- ities to diseases. Some new predictions will be validated experimentally. In addition to creating new knowledge about a variety of diseases, this research will provide urgently needed data integration and data mining tools to help scientists to translate the rich information in the publicly available ChIPx and gene expression data into new discoveries, and identify promising new areas of biomedical research.

Public Health Relevance

The publicly available genomic data on gene expression and protein-DNA interactions contain enormous amounts of information that have not been fully utilized. This proposal develops computational, statistical and software tools to extract the information and applies these tools to systematically discover novel connections between genes and biological pathways to diseases. The findings will increase our understanding of a variety of diseases and point to promising new areas of biomedical research.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006282-04
Application #
8856618
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Gilchrist, Daniel A
Project Start
2012-07-25
Project End
2017-04-30
Budget Start
2015-05-01
Budget End
2016-04-30
Support Year
4
Fiscal Year
2015
Total Cost
$393,819
Indirect Cost
$118,418
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205
Zhang, Boyang; Hong, Xiumei; Ji, Hongkai et al. (2018) Maternal smoking during pregnancy and cord blood DNA methylation: new insight on sex differences and effect modification by maternal folate levels. Epigenetics 13:505-518
Kuang, Zheng; Ji, Zhicheng; Boeke, Jef D et al. (2018) Dynamic motif occupancy (DynaMO) analysis identifies transcription factors and their binding sites driving dynamic biological processes. Nucleic Acids Res 46:e2
Kuang, Zheng; Ji, Hongkai; Boeke, Jef D (2018) Stress response factors drive regrowth of quiescent cells. Curr Genet 64:807-810
Zhou, Weiqiang; Sherwood, Ben; Ji, Zhicheng et al. (2017) Genome-wide prediction of DNase I hypersensitivity using gene expression. Nat Commun 8:1038
Kuang, Zheng; Pinglay, Sudarshan; Ji, Hongkai et al. (2017) Msn2/4 regulate expression of glycolytic enzymes and control transition from quiescence to growth. Elife 6:
Ji, Zhicheng; Zhou, Weiqiang; Ji, Hongkai (2017) Single-cell regulome data analysis by SCRAT. Bioinformatics 33:2930-2932
Wang, Jie; Xia, Shuli; Arand, Brian et al. (2016) Single-Cell Co-expression Analysis Reveals Distinct Functional Modules, Co-regulation Mechanisms and Clinical Outcomes. PLoS Comput Biol 12:e1004892
Ji, Zhicheng; Ji, Hongkai (2016) TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res 44:e117
Ji, Zhicheng; Vokes, Steven A; Dang, Chi V et al. (2016) Turning publicly available gene expression data into discoveries using gene set context analysis. Nucleic Acids Res 44:e8
Zhou, Weiqiang; Sherwood, Ben; Ji, Hongkai (2016) Computational Prediction of the Global Functional Genomic Landscape: Applications, Methods, and Challenges. Hum Hered 81:88-105

Showing the most recent 10 out of 18 publications