RNA-Seq has revolutionized transcriptomics and is one of the most important high-throughput sequencing assays invented in recent years. The key computational problem is that of de novo assembly: the reconstruction of the transcripts and their abundances from tens to hundreds of millions of short reads. The problem is challenging due to a confluence of several factors: large number of different transcripts (tens of thousands), long repeat across transcripts due to alternative splicing, widely varying abundances across transcripts, and the presence of read errors. Existing assemblers are mostly designed based on heuristic considerations and implement ad hoc methods that lead to unreliable transcriptome reconstructions. An accurate RNA-Seq assembler would enable more accurate identification of fusions in cancer transcriptomes, better gene annotations in model and non-model organisms, and more complete analyses of the dynamics of alternative splicing driving developmental and regulatory programs. In this proposal, we offer a systematic approach to the design of RNA-Seq assemblers based on information theoretic principles. We start by determining conditions data that guarantee that there enough information to reconstruct the transcriptome, and then propose an assembly algorithm that can reconstruct with the minimal information. This algorithm optimally uses the available read information to resolve repeats and disambiguate isoforms. A key insight derived from the information theoretic approach is that widely varying abundances across transcripts, rather than a complication, can actually be exploited as signatures of different transcripts to disambiguate among them. Based on our initial ideas, we have built, evaluated and compared an initial prototype with several existing software, on both real and simulated data. The encouraging results provide evidence that our approach, which we will fully develop, implement and evaluated during the funded period, can significantly outperform existing software. Additional functionalities such as mixed short/long read assembly, genome-assisted assembly and joint processing of multiple RNA samples, will be designed and incorporated into the software as part of the proposed project.

Public Health Relevance

The problem of transcriptome assembly is fundamental for clinical applications of RNA-Seq technology, especially for diagnostics applications requiring the detection of aberrant transcripts. We propose a novel approach to assembly based on the principles of information theory that offers a rigorous approach leveraging the widely varying abundances across transcripts to resolve and assembly complex gene structures. We provide preliminary results comparing a prototype implementation against several existing software to validate the approach, and propose to build and evaluate a complete scalable assembler that will be both fast and (provably) accurate in the high-throughput assembly of RNA-Seq transcriptome data.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Gilchrist, Daniel A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Berkeley
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Seigal, Anna (2018) Gram Determinants of Real Binary Tensors. Linear Algebra Appl 544:350-369
Zhang, Jesse M; Fan, Jue; Fan, H Christina et al. (2018) An interpretable framework for clustering single-cell RNA-Seq datasets. BMC Bioinformatics 19:93
Ntranos, Vasilis; Kamath, Govinda M; Zhang, Jesse M et al. (2016) Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol 17:112
Pimentel, Harold; Sturmfels, Pascal; Bray, Nicolas et al. (2016) The Lair: a resource for exploratory analysis of published RNA-Seq data. BMC Bioinformatics 17:490