RNA-Sequencing (RNA-Seq) analysis provides a critical means to understand gene functions. High-throughput RNA-Seq data are frequently measured under multiple conditions from the same set of samples. For example, in the NIH Common Fund?s Genotype-Tissue Expression (GTEx) project, samples from different tissues are collected from each post-mortem donor for sequencing. For another study on ultraviolet (UV) radiation, skin keratinocytes from the same set of subjects are exposed to different radiation doses and durations before sequencing. Such common-sample, multi-condition RNA-Seq data have information shared across both samples and conditions, and have the potential to provide key insights into gene functions. However, despite great endeavors to collect such data, there is a lack of analytical methods and computational tools to maximize their potential. Important tasks such as missing data imputation, functional gene module identification and association analysis remain unaddressed. In this proposal, we will build an innovative and powerful paradigm to analyze multi-condition RNA-Seq data and thus improve our understanding of gene functions. To leverage information across conditions, samples and genes simultaneously, we propose to model RNA-Seq data as multi-way tensor arrays. We will develop novel tensor methods and theory that are appropriate for read count data. In particular, our first aim is to extend tensor completion methods for block-wise missing RNA-Seq data imputation. By modeling unobserved samples as missing blocks in a tensor, we will aggregate information along different modes (subjects, conditions, genes) to impute missing values.
The second aim develops flexible tensor co-clustering methods, which simultaneously cluster genes, samples and conditions, for co- expressed gene module identification.
The third aim i s to build new tensor response regression models to associate gene modules with genotype and covariates which will provide insights into genetic regulation such as expression quantitative trait loci (eQTL). Finally, in the fourth aim, we will develop scalable statistical software to implement the proposed methods and make them more broadly applicable. We will apply the methods to the GTEx multi-tissue data and UV multi-condition data, and gain novel insights into gene expression and regulation. The proposed research will likely transform how we analyze multi-condition RNA- Seq data and enhance our understanding of human genomics and its relation to public health.

Public Health Relevance

High-throughput RNA-Seq data collected under multiple conditions (e.g., tissues, experimental conditions, time points) from the same set of subjects provide an ideal resource for studying gene function and regulation. We propose to develop novel statistical methods and computational tools to maximize the utilization of these data and provide critical new insights into human genomics and its relation to public health.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Pillai, Ajay
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Columbia University (N.Y.)
Biostatistics & Other Math Sci
Schools of Public Health
New York
United States
Zip Code