Algorithms and Software for Provably Accurate De Novo RNA-Seq Assembly

Kannan, Sreeram; Pachter, Lior; Tse, David

Abstract

RNA-Seq has revolutionized transcriptomics and is one of the most important high-throughput sequencing assays invented in recent years. The key computational problem is that of de novo assembly: the reconstruction of the transcripts and their abundances from tens to hundreds of millions of short reads. The problem is challenging due to a confluence of several factors: large number of different transcripts (tens of thousands), long repeat across transcripts due to alternative splicing, widely varying abundances across transcripts, and the presence of read errors. Existing assemblers are mostly designed based on heuristic considerations and implement ad hoc methods that lead to unreliable transcriptome reconstructions. An accurate RNA-Seq assembler would enable more accurate identification of fusions in cancer transcriptomes, better gene annotations in model and non-model organisms, and more complete analyses of the dynamics of alternative splicing driving developmental and regulatory programs. In this proposal, we offer a systematic approach to the design of RNA-Seq assemblers based on information theoretic principles. We start by determining conditions data that guarantee that there enough information to reconstruct the transcriptome, and then propose an assembly algorithm that can reconstruct with the minimal information. This algorithm optimally uses the available read information to resolve repeats and disambiguate isoforms. A key insight derived from the information theoretic approach is that widely varying abundances across transcripts, rather than a complication, can actually be exploited as signatures of different transcripts to disambiguate among them. Based on our initial ideas, we have built, evaluated and compared an initial prototype with several existing software, on both real and simulated data. The encouraging results provide evidence that our approach, which we will fully develop, implement and evaluated during the funded period, can significantly outperform existing software. Additional functionalities such as mixed short/long read assembly, genome-assisted assembly and joint processing of multiple RNA samples, will be designed and incorporated into the software as part of the proposed project.

Public Health Relevance

The problem of transcriptome assembly is fundamental for clinical applications of RNA-Seq technology, especially for diagnostics applications requiring the detection of aberrant transcripts. We propose a novel approach to assembly based on the principles of information theory that offers a rigorous approach leveraging the widely varying abundances across transcripts to resolve and assembly complex gene structures. We provide preliminary results comparing a prototype implementation against several existing software to validate the approach, and propose to build and evaluate a complete scalable assembler that will be both fast and (provably) accurate in the high-throughput assembly of RNA-Seq transcriptome data.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG008164-03
Application #: 9284513
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Gilchrist, Daniel A

Project Start: 2015-09-16
Project End: 2017-10-31
Budget Start: 2017-07-01
Budget End: 2017-10-31
Support Year: 3
Fiscal Year: 2017
Total Cost
Indirect Cost

Institution

Name: University of California Berkeley
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 124726725

City: Berkeley
State: CA
Country: United States
Zip Code: 94704

Related projects


NIH 2017 R01 HG	Algorithms and Software for Provably Accurate De Novo RNA-Seq Assembly Kannan, Sreeram; Pachter, Lior S.; Tse, David / University of California Berkeley
NIH 2017 R01 HG	Algorithms and Software for Provably Accurate De Novo RNA-Seq Assembly Kannan, Sreeram; Pachter, Lior S.; Tse, David / California Institute of Technology
NIH 2016 R01 HG	Algorithms and Software for Provably Accurate De Novo RNA-Seq Assembly Kannan, Sreeram; Pachter, Lior S.; Tse, David / University of California Berkeley
NIH 2015 R01 HG	Algorithms and Software for Provably Accurate De Novo RNA-Seq Assembly Pachter, Lior S.; Kannan, Sreeram; Tse, David / University of California Berkeley	$476,329

Publications

Seigal, Anna (2018) Gram Determinants of Real Binary Tensors. Linear Algebra Appl 544:350-369

Zhang, Jesse M; Fan, Jue; Fan, H Christina et al. (2018) An interpretable framework for clustering single-cell RNA-Seq datasets. BMC Bioinformatics 19:93

Ntranos, Vasilis; Kamath, Govinda M; Zhang, Jesse M et al. (2016) Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol 17:112

Pimentel, Harold; Sturmfels, Pascal; Bray, Nicolas et al. (2016) The Lair: a resource for exploratory analysis of published RNA-Seq data. BMC Bioinformatics 17:490

Comments

Be the first to comment on Sreeram Kannan's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: