Next-generation, Illumina RNA sequencing (RNA-seq) is by far the most widely used assay for investigating animal transcriptomes, and numerous public RNA-seq data sets have been generated for various biological conditions in multiple species. However, there remain several barriers in using short RNA-seq reads to accurately identify the splicing structures and quantify the abundances of full-length RNA transcripts. In this proposal, we will develop a series of novel statistical and computational methods to improve the robustness of transcript identification and the accuracy of transcript quantification from Illumina RNA-seq data.
(Aim 1) We will develop a novel screening method to construct transcript candidates by first detecting sparse splicing structures from multiple RNA-seq data sets for a given biological condition. These transcript candidates will significantly reduce the search space of downstream transcript identification methods and hence improve their precision.
(Aim 2) We will develop a robust transcript identification method to identify novel transcripts in a conservative manner from RNA-seq data given existing annotations. Our method will be based on statistical model selection under the Neyman-Pearson paradigm, which will allow users to control the false positive rate of our identified novel transcripts under any given threshold with high probability.
(Aim 3) We will develop an accurate transcript quantification method to effectively leverage multiple RNA-seq data sets and to simultaneously assess the data quality based on low-throughput gold standards and cross-data similarities. All of these methods will be first used to study transcripts in mouse macrophage, for which gold standard qPCR and full length cDNA sequences will be generated for training and method validation. The methods will then be more broadly tested in other biological systems where suitable gold standard data is available. Our methods and software will significantly facilitate the use of Illumina RNA-seq data for gene expression studies at the transcript level, increase reproducibility of scientific discoveries from transcriptomic studies, and improve our understanding of gene expression mechanisms in various biological conditions.
This project will create a set of computational methods to improve the robustness and accuracy of detecting and quantifying RNA molecules from next-generation RNA sequencing data. Those methods will serve as useful tools for investigating gene expression changes in different biological conditions on a finer scale at the transcript level. We will distribute the methods in open-source software packages to benefit the scientific and biomedical communities.