The transcription and processing of eukaryotic genes usually generate more than one RNA product. The combined effects of alternative transcription initiation, RNA splicing, and 3? end formation define the diversity of RNA species that are generated by a gene. Alternative RNA molecules can differ in their coding or regulatory sequences or both, leading to production of different proteins, different amounts of the same protein, different subcellular localization, or changes to mRNA half-life. Changes to any of these properties can cause numerous human diseases including many cancer types. It is therefore critical to understand how alternative transcript processing influences gene expression, and how variants in genes involved in transcript processing found in human diseases change this relationship. Current approaches to measure how RNA diversity affects gene expression are inadequate. Deep sequencing of billions of short reads (~100 nucleotides) is currently used to measure properties of RNA species in cells, such as which proteins are bound to them, how well they are translated, where they are located, and other properties. These events are typically marked by nuclease digestion or truncation of cDNA products during reverse transcription. However, many human RNA molecules are thousands of nucleotides long, so it is difficult to reconstruct which alternative transcript short reads map onto. This represents a fundamental barrier to understanding many aspects of how alternative RNAs affect gene expression, such as which individual transcripts make more or less protein, whether different splicing events are correlated, coordination between 5? and 3? regulatory untranslated regions of mRNA, and many more. Recently, long-read sequencing approaches have been developed that can sequence complete mRNA molecules for most human genes. However, long-read sequencing is currently only used to identify RNA species and cannot measure their functional properties. This proposal advances a series of innovative approaches to encode biological function in sequence space, which is then read out using long-read sequencing. Completion of the proposed research will answer basic questions, such as what is the distribution of ribosomes per individual mRNA molecule for native human genes, and also answer disease-relevant questions, such as how variants in RNA binding proteins found in human cancers change protein production. The approaches developed in this work will enable a fundamentally new way of measuring functional properties of RNA molecules in human cells and will broadly impact biomedical research.
Most eukaryotic genes generate diverse RNA molecules through many layers of transcript processing, such as alternative transcription initiation, splicing, and polyadenylation. Both the generation and functional output of RNA molecules can change between cell types, environmental perturbations, or disease states, reshaping the complement of proteins found in cells. This proposal leverages new approaches to directly identify how diverse RNA molecules interface with cellular RNA binding proteins, including the ribosome, discovering new biology and empowering discovery throughout biomedical science.