The rapid development of Third Generation, Long Read Sequencing (LRS) platforms such as Pacbio and Oxford Nanopore Technologies (ONT) have enabled increasing precision and higher-throughput sequencing of transcripts. Long reads can produce full-length transcript sequences, overcoming much of the uncertainty of short-read methods to accurately define transcripts, particularity for those genes with alternative splicing (more than 90% of human genes), for which short read sequencing has thus far proved difficult. LRS is therefore the natural choice for the study of the expression of transcript variants and of the role of alternative isoforms in disease and development. While the first iterations of the long-read technologies did not produce enough reads to quantify more than the highest expressed transcripts, the current sequencing depth of up to 8 million reads per SMRT cells on the Sequel 2 platforms promises reliable quantifiability for more modestly expressed genes. Also significant yield increases have been reported for Nanopore. This suggests that LRS may have reached sufficient throughput to enable accurate quantification of gene expression and differential expression analyses. LRS transcriptomics data have, however, specific properties that are absent in other transcriptomics technologies, such are partial matches of reference transcript models. Therefore specific methods for quantification and statistical analysis need to be developed. In this Project, we aim to characterize in detail the data distribution in long reads data, propose strategies to deal with their particular read uncertainty issues and develop new strategies for differential expression analysis. The overarching goal is to create the analytical framework to fully leverage LRS technologies for the study of isoform dynamics in relation of biomedical relevant questions.
The goal of this project is to develop the SQANTI-QDE software, the first integral framework for the management of long read sequencing Iso-seq experiments. SQANTI-QDE will provide, in one tool, functionalities for the annotation and processing of multiple samples, improved definition of bona-fide transcripts, quantification of transcript expression, flexible creation of count matrices, data normalization, and differential expression and isoform usage analysis. Highly replicated, deep sequenced long read sequencing transcriptomics datasets will be created as part of this project.