Next-generation, Illumina RNA sequencing (RNA-seq) is by far the most widely used assay for investigating animal transcriptomes, and numerous public RNA-seq data sets have been generated for various biological conditions in multiple species. However, there remain several barriers in using short RNA-seq reads to accurately identify the splicing structures and quantify the abundances of full-length RNA transcripts. In this proposal, we will develop a series of novel statistical and computational methods to improve the robustness of transcript identification and the accuracy of transcript quantification from Illumina RNA-seq data.
(Aim 1) We will develop a novel screening method to construct transcript candidates by first detecting sparse splicing structures from multiple RNA-seq data sets for a given biological condition. These transcript candidates will significantly reduce the search space of downstream transcript identification methods and hence improve their precision.
(Aim 2) We will develop a robust transcript identification method to identify novel transcripts in a conservative manner from RNA-seq data given existing annotations. Our method will be based on statistical model selection under the Neyman-Pearson paradigm, which will allow users to control the false positive rate of our identified novel transcripts under any given threshold with high probability.
(Aim 3) We will develop an accurate transcript quantification method to effectively leverage multiple RNA-seq data sets and to simultaneously assess the data quality based on low-throughput gold standards and cross-data similarities. All of these methods will be first used to study transcripts in mouse macrophage, for which gold standard qPCR and full length cDNA sequences will be generated for training and method validation. The methods will then be more broadly tested in other biological systems where suitable gold standard data is available. Our methods and software will significantly facilitate the use of Illumina RNA-seq data for gene expression studies at the transcript level, increase reproducibility of scientific discoveries from transcriptomic studies, and improve our understanding of gene expression mechanisms in various biological conditions.

Public Health Relevance

This project will create a set of computational methods to improve the robustness and accuracy of detecting and quantifying RNA molecules from next-generation RNA sequencing data. Those methods will serve as useful tools for investigating gene expression changes in different biological conditions on a finer scale at the transcript level. We will distribute the methods in open-source software packages to benefit the scientific and biomedical communities.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
1R01GM120507-01
Application #
9161008
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
2016-09-01
Project End
2021-05-31
Budget Start
2016-09-01
Budget End
2017-05-31
Support Year
1
Fiscal Year
2016
Total Cost
Indirect Cost
Name
University of California Los Angeles
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
092530369
City
Los Angeles
State
CA
Country
United States
Zip Code
90095
Li, Wei Vivian; Chen, Yiling; Li, Jingyi Jessica (2017) TROM: A Testing-Based Method for Finding Transcriptomic Similarity of Biological Samples. Stat Biosci 9:105-136
Gao, Ruiqi; Li, Jingyi Jessica (2017) Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons. BMC Genomics 18:234
Yang, Yang; Yang, Yu-Cheng T; Yuan, Jiapei et al. (2016) Large-scale mapping of mammalian transcriptomes identifies conserved genes associated with different cell states. Nucleic Acids Res :
Gao, Qinghui; Ho, Christine; Jia, Yingmin et al. (2012) Biclustering of linear patterns in gene expression data. J Comput Biol 19:619-31