Alternative splicing (AS) is a gene regulatory mechanism with important roles in human biology and disease. High throughput sequencing of RNA (RNA-seq) is making it possible to survey the expressed genes and their alternative splicing variations in a wide variety of cellular conditions. However, the short reads are challenging to analyze, demanding highly sophisticated computational methods that can extract meaningful AS information efficiently, accurately, and in a comprehensive way. While there has been great progress so far, current methods based on assembling the short reads into transcript annotations have reached a plateau. We propose two innovations that can help overcome the limits. The first is one-step simultaneous analyses of multiple samples in an RNA-seq collection, in contrast with the current two-step approach that analyzes each sample separately and then merges the results. The second is to create and interrogate assembly-free representations of AS. The project will design a suite of tools that will leverage the latent information in large collections of samples and from heterogeneous data types to build complete and accurate AS signatures of tissues and cell types, and to elucidate the regulatory circuitry of AS and its functional implications.
Aim 1 will develop a high- performance multi-sample transcript assembly tool, combining subexon graph representations of genes and AS variations, statistical methods for improved feature detection, and search space reduction techniques for efficient sample processing.
Aim 2 will build highly efficient and accurate feature selection tools to detect and characterize assembly-free AS variations (subexons and introns), simultaneously from collections of RNA-seq samples. It will combine novel regularized programs with complex models of intronic `noise' and other RNA-seq confounders, and enable analyses of differential splicing and to identify individual and group-specific variations. Lastly, Aim 3 will develop a system to comprehensively model the regulatory and functional circuitry of AS and the effects of mutations, starting from deep learning models of sequences and alignments and integrating expression, sequence, epigenetic and mutation data across tissues, cell types and conditions. We will rigorously test and evaluate all tools in simulations and on large public data sets, as well as on thyroid and head and neck cancer data provided by our collaborators, and we will experimentally validate random subsets of predictions with capillary electrophoresis and qRT-PCR. Collectively, the concepts, methods and tools will establish a new framework for analyzing RNA-seq data that can efficiently tackle the `big data' challenges, leading to more complete discovery and annotation of AS structure and function in human health and disease.
Alternative splicing is a fundamental gene regulatory mechanism with important roles in human physiology and disease. Next generation RNA sequencing (RNA-seq) is making it possible to characterize alternative splicing in great detail, however, current bioinformatics analysis tools miss important variations. The project will design a suite of innovative tools that criss-cross information across multiple RNA-seq samples, and across heterogeneous sequence, epigenetic and expression data, to more comprehensively and more accurately determine the structure, function and regulation of alternative splicing in human biology and disease.