Currently, transcriptome analysis by short-read RNA-seq1 is a core component of research in nearly all fields of biology. However, even specialized RNA-seq protocols cannot comprehensively identify and quantify full-length RNA transcript isoforms. Nonetheless, isoform- level resolution is essential for transcriptome annotation of model and non-model organisms. While long-read sequencing technology has the potential to overcome this challenge, current leading approaches, such as Pacific Biosciences (PacBio)2 or Oxford Nanopore Technologies (ONT)3 long-read technologies, have their own set of drawbacks that limit their wide adoption for transcriptome annotation. For example, PacBio IsoSeq cannot generate the tens of millions of reads required to determine comprehensive isoform-level transcriptomes, while ONT cDNA and direct RNA sequencing methods cannot do so at the required accuracy4. To overcome these limitations, we propose to develop new short- and long-read sequencing and computational approaches. We are building on our experience with short-read technology5 to develop a new short-read cDNA sequencing method that can identify transcription start sites, polyA sites and splice sites with very high accuracy in a single easy-to-implement experiment. Currently, this requires three, separate and complex experiments. Further, we are significantly improving our R2C2 method, which is already among the most capable long-read full-length cDNA sequencing methods currently available6 (R2C2 improves on standard ONT approaches by increasing accuracy from 87% to 94% while producing more full-length cDNA sequences). To improve R2C2 further, we will increase its read accuracy to 98%, effective throughput by at least a factor of 2, and make it possible to capture very long cDNA molecules. To more accurately identify and quantify isoforms based on the resulting short- and long-read data, we will modify our Mandalorion3,6 isoform identification software to take full advantage of both data types. Together these advances will represent an integrated workflow for transcriptome annotation of unprecedented power. We will then apply this integrated workflow to improve transcriptome annotations of homo sapiens and commonly used model organisms. Generating high quality isoform-level transcriptome data for Homo sapiens, Rattus norvegicus, Mus musculus, Drosophila melanogaster, and Caenorhabditis elegans will create valuable resources for biomedical research and enable us to investigate how isoform diversity has evolved and is regulated.

Public Health Relevance

Despite the wave of new insights it has produced, transcriptome analysis by short-read RNA-seq has been limited by its inability to resolve complex gene and isoform expression. Full-length cDNA sequencing is rapidly improving and has the power to delineate full-length isoforms but is limited in throughput. We aim to leverage short- and long-read approaches to analyze transcriptomes with the goal of 1.) developing a powerful integrated workflow for transcriptome annotation and 2.) applying this workflow to improve transcriptome annotations of human, rat, mouse, Drosophila, and C. elegans.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Unknown (R35)
Project #
1R35GM133569-01
Application #
9798492
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Ravichandran, Veerasamy
Project Start
2019-08-01
Project End
2024-07-31
Budget Start
2019-08-01
Budget End
2020-07-31
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of California Santa Cruz
Department
Engineering (All Types)
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
125084723
City
Santa Cruz
State
CA
Country
United States
Zip Code
95064