While RNA-Seq experiments based on Second Generation Sequencing (SGS) short reads have enabled remarkable advances in our ability to analyze the transcriptome, a few fundamental problems remain unsolved due to the high complexity of the genome and the inability to identify combinatorial genomic events. Third Generation Sequencing (TGS), including PacBio sequencing and Oxford Nanopore Technologies (ONT) which provide much longer reads (1-100kb), has the potential to overcome these problems. However, the current high-cost and laborious strategy of only using PacBio data is not practical for mid-size labs. Hybrid sequencing (?Hybrid-Seq?), which integrates TGS and SGS data, has emerged as an approach to address the limitations associated with analysis of short SGS reads and the error rate of TGS reads. However, tools to analyze Hybrid- Seq transcriptome data are not currently available because the majority of methodological developments have focused on Hybrid-Seq genomic data. In order to improve our understanding of transcriptome complexity, we will develop a comprehensive Hybrid-Seq platform of novel statistical and computational methods to analyze TGS long reads with the aid of SGS short reads, and to identify gene isoforms, fusion transcripts and allele- specific expression (ASE). The proposed studies build on our published and preliminary work where we developed methods for error correction for TGS data and detection of novel gene isoforms, which were applied to Hybrid-Seq transcriptome data from human embryonic stem cells (hESCs).
In Aim 1, we will develop computational and statistical approaches to identify and quantify gene isoforms.
In Aim 2, we will develop computational methods to discover fusion transcripts.
In Aim 3, we will determine the haplotypes of gene alleles and quantify ASE using Hybrid-Seq data. The methods developed in this proposal will be integrated into a software platform for analysis of Hybrid-Seq transcriptome data. This user-friendly bioinformatics platform will have important positive impacts by providing an unprecedented opportunity for comprehensive transcriptome profiling, with broad applicability and higher resolution. In addition, these tools will enable more researchers to apply Hybrid-Seq to their transcriptome studies.
/ PUBLIC HEALTH RELEVANCE STATEMENT Hybrid-Seq strategy combines the strengths of Third Generation Sequencing and Second Generation Sequencing and overcomes the weakness of two techniques. Our sophisticated data analysis platform will provide a set of robust and handy tools to fully analyze Hybrid-Seq transcriptome data, such that this cutting-edge technology can be affordable and feasible in biomedical research laboratories of all sizes. The analysis of Hybrid-Seq data can identify real products of functional genes, providing a solid foundation for human transcriptome research.