A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data

Patro, Robert; Love, Michael

Abstract

We propose to develop improved, modular pipelines for more accurate and reproducible RNA-seq analyses. RNA- seq experiments are widely used in biological and biomedical sciences to determine the expression level of all genes and isoforms across multiple samples. Raw RNA-seq data must be pre-processed to determine abundances of RNA molecules. State-of-the-art tools for quantifying RNA abundances are fast and ef?cient, model and correct for common technical biases, and provide estimates of the uncertainty of the abundances. Downstream tools for visualization and statistical testing of abundance ideally should incorporate uncertainty of abundance estimates from the quanti?cation step, take into account the sampling variability inherent in observations in all sequencing experiments, and estimate, for each transcript, the underlying biological variation in abundances across samples. While isolated tools ful?ll a subset of the above characteristics, we propose to develop a pipeline which addresses all of these, while at the same time leveraging the powerful existing infrastructure for gene expression analysis. Our modular approach to improving the current RNA-seq analysis pipelines will also seek to make use of the best downstream tools for gene set analysis and dynamic report generation. Current RNA-seq computational pipelines do not keep track of critical pieces of metadata throughout the analysis, including genome and transcriptome version, such that ?nal results cannot reliably be repro- duced or put in the correct genomic context as the information about annotation provenance may be lost. While fast and lightweight tools have been quickly adopted for gene- and transcript-level quanti?cation, they are not yet optimized for certain RNA-seq analysis tasks such as quanti?cation of allele speci?c expression. We have developed a set of top performing tools for abundance quanti?cation and downstream inference. We propose to formalize our existing tools into a pipeline, and build additional tools and infrastructure, which optimally estimates and propagates uncertainty from abundance estimation (described in Aim 1), and which stores critical provenance metadata automatically on the user's behalf ? this metadata tagging and propagation will be integrated with community resources (described in Aim 2). Furthermore, we propose building out the capabilities of our existing quanti?cation infrastructure to allow for improved mapping accuracy and more robust and accurate allelic expression estimation (described in Aim 3).

Public Health Relevance

RNA sequencing is a critical assay in genomics and biomedical research, being leveraged in a variety of contexts to study human health and disease. Deriving relevant biological information from RNA sequencing experiments relies on ef?cient, sophisticated algorithms to quantify from raw data the relative abundances of RNA transcripts. The new tools proposed here will facilitate more powerful, accurate, and comprehensive analyses, through the development of new capabilities and the propagation of uncertainty and metadata from quanti?cation algorithms to downstream visualization and inference tools, leading to cleaner and more reproducible results.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG009937-04
Application #: 9954129
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Sen, Shurjo Kumar

Project Start: 2020-03-12
Project End: 2023-06-30
Budget Start: 2020-07-01
Budget End: 2021-06-30
Support Year: 4
Fiscal Year: 2020
Total Cost
Indirect Cost

Institution

Name: University of Maryland College Park
Department
Type: Organized Research Units
DUNS #: 790934285

City: College Park
State: MD
Country: United States
Zip Code: 20742

Related projects


NIH 2020 R01 HG	A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data Patro, Robert; Love, Michael Isaiah / University of Maryland College Park
NIH 2019 R01 HG	A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data Patro, Robert; Love, Michael Isaiah / State University New York Stony Brook
NIH 2019 R01 HG	A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data Patro, Robert; Love, Michael Isaiah / University of Maryland College Park
NIH 2018 R01 HG	A Modular Framework for Accurate, Efficient, and Reproducible Analysis of RNA-Seq Data Patro, Robert; Love, Michael Isaiah / State University New York Stony Brook

Comments

Be the first to comment on Robert Patro's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: