A comprehensive understanding of how genes' activities are controlled temporally and spatially is crucial for studying human development and diseases. Transcription factors (TFs) are an important class of regulatory proteins that can control genes' transcriptional activities by binding to target genes' regulatory DNA sequences called cis-regulatory elements (CREs). A map of genome-wide activities of CREs, or ?regulome?, in all cell types and biological conditions will provide a foundation for investigating the basic operating rules of biology, interpreting how genetic variants cause diseases, and guiding the development of disease treatment strategies. Unfortunately, existing experimental regulome mapping technologies cannot analyze a large number of samples ef?ciently. Thus far, they have only been applied to map regulomes in a small fraction of all biological contexts. As a result, today a comprehensive map of human regulatory landscape is still lacking. This study aims to develop a solution to mapping regulomes in a massive number of biological samples from diverse cell types and conditions by leveraging publicly available functional genomic data. We will use the rich gene expression and regulome data generated by the Encyclopedia of DNA Elements (ENCODE) project to develop a new prediction approach that predicts a biological sample's regulome using its transcriptome (Aim 1). We will then apply the trained prediction models to 290,000+ publicly available human gene expression samples in the Gene Expression Omnibus (GEO) database to create a regulome map that covers hundreds of thousands more biological contexts than existing regulome data (Aim 2). We will also develop a method to help researchers explore the massive datasets to gain biological insights into gene regulation by projecting the data to their low- dimensional structure re?ecting their developmental trajectory (Aim 3). Our research will create new analytical methods for predicting ultra-high-dimensional outcomes using ultra- high-dimensional predictors, making cross-platform predictions when the training and application data are gener- ated by different technological platforms with systematic platform differences, and retrieving the low-dimensional spanning tree structure from a massive dataset. Applying these new methods to the vast amounts of publicly available gene expression data will allow us to address a major challenge in regulome mapping that cannot be solved using existing experimental technologies. By enabling fast and cost-ef?cient mapping and analysis of human gene regulatory landscape, the proposed research can have a major impact on future studies of human development and diseases.

Public Health Relevance

Understanding how genes? activities are controlled temporally and spatially is crucial for studying human development and diseases. This proposal develops statistical and computational tools that leverage massive amounts of publicly available genomic data to comprehensively map and analyze human gene regulatory landscape. By providing a big-data-driven solution to analyzing gene regulation in a vast number of normal and disease cell types, technologies developed in this proposal are expected to have a major impact on advancing future studies of human diseases.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG009518-02
Application #
9762143
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Gilchrist, Daniel A
Project Start
2018-08-10
Project End
2022-05-31
Budget Start
2019-06-01
Budget End
2020-05-31
Support Year
2
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205