We propose to develop improved, modular pipelines for more accurate and reproducible RNA-seq analyses. RNA- seq experiments are widely used in biological and biomedical sciences to determine the expression level of all genes and isoforms across multiple samples. Raw RNA-seq data must be pre-processed to determine abundances of RNA molecules. State-of-the-art tools for quantifying RNA abundances are fast and ef?cient, model and correct for common technical biases, and provide estimates of the uncertainty of the abundances. Downstream tools for visualization and statistical testing of abundance ideally should incorporate uncertainty of abundance estimates from the quanti?cation step, take into account the sampling variability inherent in observations in all sequencing experiments, and estimate, for each transcript, the underlying biological variation in abundances across samples. While isolated tools ful?ll a subset of the above characteristics, we propose to develop a pipeline which addresses all of these, while at the same time leveraging the powerful existing infrastructure for gene expression analysis. Our modular approach to improving the current RNA-seq analysis pipelines will also seek to make use of the best downstream tools for gene set analysis and dynamic report generation. Current RNA-seq computational pipelines do not keep track of critical pieces of metadata throughout the analysis, including genome and transcriptome version, such that ?nal results cannot reliably be repro- duced or put in the correct genomic context as the information about annotation provenance may be lost. While fast and lightweight tools have been quickly adopted for gene- and transcript-level quanti?cation, they are not yet optimized for certain RNA-seq analysis tasks such as quanti?cation of allele speci?c expression. We have developed a set of top performing tools for abundance quanti?cation and downstream inference. We propose to formalize our existing tools into a pipeline, and build additional tools and infrastructure, which optimally estimates and propagates uncertainty from abundance estimation (described in Aim 1), and which stores critical provenance metadata automatically on the user's behalf ? this metadata tagging and propagation will be integrated with community resources (described in Aim 2). Furthermore, we propose building out the capabilities of our existing quanti?cation infrastructure to allow for improved mapping accuracy and more robust and accurate allelic expression estimation (described in Aim 3).

Public Health Relevance

RNA sequencing is a critical assay in genomics and biomedical research, being leveraged in a variety of contexts to study human health and disease. Deriving relevant biological information from RNA sequencing experiments relies on ef?cient, sophisticated algorithms to quantify from raw data the relative abundances of RNA transcripts. The new tools proposed here will facilitate more powerful, accurate, and comprehensive analyses, through the development of new capabilities and the propagation of uncertainty and metadata from quanti?cation algorithms to downstream visualization and inference tools, leading to cleaner and more reproducible results.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG009937-01A1
Application #
9662069
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Gilchrist, Daniel A
Project Start
2018-09-18
Project End
2023-06-30
Budget Start
2018-09-18
Budget End
2019-06-30
Support Year
1
Fiscal Year
2018
Total Cost
Indirect Cost
Name
State University New York Stony Brook
Department
Biostatistics & Other Math Sci
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
804878247
City
Stony Brook
State
NY
Country
United States
Zip Code
11794