RNA-seq has become standard routine in many biological and biomedical experiments to study gene activities. A very first step of RNA-seq analysis is usually to quantify the expression abundance of each transcript in the reference transcriptome. However, studies have showed that current transcriptome is incomplete, which limits the accuracy of expression quantification. As large-scale RNA-seq data are now available, an efficient and robust way of constructing transcriptome is the assembly of the full-length expressed transcripts from a set of RNA-seq samples, a computational problem known as meta-assembly. This proposal addresses this problem and aims to develop efficient meta-assemblers for short-reads and long-reads RNA-seq data. As previous studies, we have developed so far the most accurate single-sample assemblers Scallop (Nature Biotechnology, 2017; for short-reads RNA-seq) and Scallop-LR (for long-reads RNA-seq). The core of Scallop and Scallop-LR is the use of splice graph together with phasing paths, which encode reads spanning more than two vertices, to represent reads alignment, and a novel algorithm that decomposes the splice graph while preserves all phasing paths. This data structure and idea of ?phase-preserving? provides algorithmic foundations for our proposed meta-assembly algorithms. The key of meta-assembly is to take advantage of shared and complementary information in the given samples. We propose to combine multiple samples at the splice graph level. Specifically, for each gene locus, we construct a single combined splice graph, through merging individual splice graphs and pooling their phasing paths. To keep the information in individual splice graphs, their typologies will be encoded as additional phasing paths. The entire data structure is therefore space-efficient and loss-free, and can be piped into following phasing-preserving algorithms for decomposition. We will specialize our existing phasing-preserving algorithms to handle paired-end phasing paths and long phasing paths. Eventually, statistical methods will be developed to infer the statistical significance of each individual assembled transcript, and multiple hypothesis testing will be performed to control overall falsely discovered transcripts. We also propose a new consensus-approach that learns a discriminator to automatically select the optimal algorithm for different meta-assembly instances. The outcomes of this project will be open-source, easy-to-use, reproducible and accurate meta-assemblers for short-reads and long-reads RNA-seq data, respectively. These meta-assemblers will then enable more accurate identification of novel isoforms and the annotation of gene structures. Combined with large-scale RNA-seq data, data-driven transcriptomes can be constructed, benefiting downstream study such as RNA-seq quantification and differential analysis.

Public Health Relevance

The new algorithms for assembling multiple RNA-seq samples we have proposed here will improve identifying novel transcripts and annotating gene structures. Together with existing large-scale RNA-seq data in various repositories, our multiple-sample assembly methods can therefore be used to construct accurate data-driven transcriptomes, a prerequisite for downstream expression quantification and differential analysis. As RNA-seq has become routine assay to measure gene activities, such improvement will therefore widely advance biological and biomedical research.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Gilchrist, Daniel A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Pennsylvania State University
Biostatistics & Other Math Sci
Biomed Engr/Col Engr/Engr Sta
University Park
United States
Zip Code