Understanding both normal and pathogenic patterns of human gene expression can help shed light on the biology of human disease. Thousands of studies have now been undertaken measuring gene expression in different tissues and diseases. By aggregating and analyzing all available human RNA-sequencing data using a high powered computational and statistical framework, we will provide a transformative resource for characterizing human gene expression patterns including rare transcriptional events, cellular networks, and genetic variation.
In Aim 1 we propose to uniformly process all publicly available human transcriptome sequencing data and collect it into a publicly available resource called the Transcriptome Aggregation Resource (TAR); at least 150,000 samples will be processed using cloud computing. This resource will contain single-base resolution maps of expression, de novo mapped exon-exon splice junctions and allele speci?c expression across a set of common variations. We will supplement the expression data with cleaned and predicted metadata.
In Aim 2 we will develop statistical and computational methods necessary to fully realize the potential of this resource. Speci?cally we will remove unwanted variation at scale and develop mixture models to summarize the large data resource at the gene, junction and single base levels.
In Aim 3 we will analyze this resource to address fundamental questions in expression biology, include a systematic study of expression outliers and allele speci?c expression at the gene, junction and single base resolution. We will infer well-powered co-expression networks over both expressed genes and splicing patterns. This work will contribute signi?cantly to our understanding of gene expression by analyzing genomics data at a massive scale.

Public Health Relevance

Understanding both normal and pathogenic patterns of human gene expression can help shed light on the biology of human disease. Thousands of studies have now been undertaken measuring gene expression in different tissues and diseases. By aggregating and analyzing all available human RNA-sequencing data using a high powered computational and statistical framework, we will provide a transformative resource for characterizing human gene expression patterns including rare transcriptional events, cellular networks, and genetic variation.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM121459-03
Application #
9748565
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Krasnewich, Donna M
Project Start
2017-08-01
Project End
2022-07-31
Budget Start
2019-08-01
Budget End
2020-07-31
Support Year
3
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
001910777
City
Baltimore
State
MD
Country
United States
Zip Code
21205