We propose to develop improved, modular pipelines for more accurate and reproducible RNA-seq analyses. RNA- seq experiments are widely used in biological and biomedical sciences to determine the expression level of all genes and isoforms across multiple samples. Raw RNA-seq data must be pre-processed to determine abundances of RNA molecules. State-of-the-art tools for quantifying RNA abundances are fast and ef?cient, model and correct for common technical biases, and provide estimates of the uncertainty of the abundances. Downstream tools for visualization and statistical testing of abundance ideally should incorporate uncertainty of abundance estimates from the quanti?cation step, take into account the sampling variability inherent in observations in all sequencing experiments, and estimate, for each transcript, the underlying biological variation in abundances across samples. While isolated tools ful?ll a subset of the above characteristics, we propose to develop a pipeline which addresses all of these, while at the same time leveraging the powerful existing infrastructure for gene expression analysis. Our modular approach to improving the current RNA-seq analysis pipelines will also seek to make use of the best downstream tools for gene set analysis and dynamic report generation. Current RNA-seq computational pipelines do not keep track of critical pieces of metadata throughout the analysis, including genome and transcriptome version, such that ?nal results cannot reliably be repro- duced or put in the correct genomic context as the information about annotation provenance may be lost. While fast and lightweight tools have been quickly adopted for gene- and transcript-level quanti?cation, they are not yet optimized for certain RNA-seq analysis tasks such as quanti?cation of allele speci?c expression. We have developed a set of top performing tools for abundance quanti?cation and downstream inference. We propose to formalize our existing tools into a pipeline, and build additional tools and infrastructure, which optimally estimates and propagates uncertainty from abundance estimation (described in Aim 1), and which stores critical provenance metadata automatically on the user's behalf ? this metadata tagging and propagation will be integrated with community resources (described in Aim 2). Furthermore, we propose building out the capabilities of our existing quanti?cation infrastructure to allow for improved mapping accuracy and more robust and accurate allelic expression estimation (described in Aim 3).

Public Health Relevance

RNA sequencing is a critical assay in genomics and biomedical research, being leveraged in a variety of contexts to study human health and disease. Deriving relevant biological information from RNA sequencing experiments relies on ef?cient, sophisticated algorithms to quantify from raw data the relative abundances of RNA transcripts. The new tools proposed here will facilitate more powerful, accurate, and comprehensive analyses, through the development of new capabilities and the propagation of uncertainty and metadata from quanti?cation algorithms to downstream visualization and inference tools, leading to cleaner and more reproducible results.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG009937-04
Application #
9954129
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sen, Shurjo Kumar
Project Start
2020-03-12
Project End
2023-06-30
Budget Start
2020-07-01
Budget End
2021-06-30
Support Year
4
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Maryland College Park
Department
Type
Organized Research Units
DUNS #
790934285
City
College Park
State
MD
Country
United States
Zip Code
20742