RNA-seq is a powerful tool for studying molecular biology. However, without cell sorting (or related techniques), conventional RNA-seq applied to tissue samples cannot determine gene expression in underlying cell-types. This is problematic because differential gene expression observed at the tissue level is not necessarily reflected in underling cell-types, which obscures biological insight. For example, Schmiedel et al. recently applied RNA- seq to 13 purified blood cell-types from 106 individuals1, which uncovered the molecular basis of sex-specific differences in immune response. However, this was obscured when they applied RNA-seq to only whole-blood. Single-cell RNA-seq is the obvious candidate to probe cell-type-specific effects more broadly. However, for most tissues, single-cell RNA-seq has been restricted to small sample sizes, due to specialized dissociation protocols and cost. Thus, only bulk-tissue RNA-seq data are available for large sample sizes. Crucially, much of these bulk data are paired to enormous stores of informative clinical phenotypic data and additional -omics data. These datasets include large NIH initiatives such as GTEx, TCGA, and All of Us, which have collected data on genetics, disease status, outcome, drug treatments, ethnicity, sex, and much more. The critical gap is that we cannot currently study the relationship between cell-type level gene expression and any of these phenotypes. To overcome this limitation, we will develop computational tools for estimating cell-type-specific differential expression from bulk RNA-seq data, when a small reference single-cell RNA-seq dataset is available from the same tissue-type. This will allow us to study the cell-type-specific differences in expression that drive human phenotypes and diseases, unlocking the tens-of-thousands of bulk RNA-seq samples paired to phenotypic data. The basis for this research program is a previous study where we developed a method to recover the cell-type- specific effects of inherited genetic variation on gene expression in bulk breast-tumor RNA-seq data. This method allowed us to discover a novel breast cancer risk gene?which was obscured using conventional methods. Here, we posit that a similar mathematical framework can be adapted to recover any cell-type-specific effect from bulk-tissue RNA-seq. Hence, we can develop specific tools to perform multiple commonly applied analyses at cell-type-specific resolution from bulk-tissue RNA-seq by leveraging matched single-cell data, including differential expression, correlative and gene set enrichment analysis. Finally, new spatial transcriptomics technologies are emerging that enable spatially resolved gene expression to be measured directly in tissue sections. These platforms quantify gene expression in situ in ~100?m barcoded spots. Each spot captures a small cluster of cells?akin to a miniaturized bulk-tissue RNA-seq experiment. Hence, the same abstract mathematical framework can be used to identify effects such as cell-type-specific spatial variation in gene expression. Computational tools for these data are evolving quickly; thus, this award will also allow us to develop methods that meet the changing needs of these new gene expression platforms.

Public Health Relevance

Bulk-tissue RNA-seq datasets are now available in tens-of-thousands of samples, often coupled with potentially highly informative phenotypic data and other -omics data, such as that collected in electronic medical records or by NIH initiatives such as the Genotype-Tissue Expression (GTEx) project. However, current computational tools are insufficient for interrogating this bulk-tissue RNA-seq data at the level of the constituent cell-types, which has been shown to obscure biological mechanisms and insights. We propose developing computational tools that will allow us to interrogate large bulk-tissue RNA-seq at the cell-type level, by leveraging small reference single- cell RNA-seq datasets generated on the same tissue type?we will also expand these computational methodologies for the analysis of emerging spatial transcriptomics platforms.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Unknown (R35)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Krasnewich, Donna M
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
St. Jude Children's Research Hospital
United States
Zip Code