The next generation of RNA-Seq simulators for benchmarking analyses

Grant, Gregory

Abstract

RNA-Sequencing (RNA-Seq) has established itself as the primary method for studying transcription in basic research, with an emerging role in the clinic ? currently upwards of 5,000 publications using the technology are indexed in PubMed. However, the interpretation of RNA-Seq requires several complex operations including alignment, quantification, normalization and statistical analyses of various types. Since its inception a large number of algorithms have appeared for each step, creating a very confusing landscape for investigators. In order to determine the best analysis practices, numerous benchmarking studies have emerged which leverage real RNA-Seq data made from well-studied RNA samples, such as the Genetic European Variation in Health and Disease (GEUVADIS) consortium data. These valuable RNA-Seq datasets contain the biases and errors introduced by sequencing biochemistry?factors that any analysis method must account for and overcome. However, the utility of such datasets for benchmarking analysis methods is limited by the fact that we do not know the underlying truth (e.g. the true number of RNA molecules from each transcript in the original sample). Therefore researchers tend to rely heavily on simulated data, since we know everything about the true composition of these samples. There are dozens of DNA simulators aimed at benchmarking applications such as variant calling. And while the need for simulators is just as strong in RNA analysis, there are only a scant few RNA-Seq simulators available. Furthermore, the available RNA- Seq simulators are based on simplifying assumptions that greatly restrict their utility for benchmarking anything but the most upstream steps in the analysis pipeline (e.g. alignment). The further downstream the analysis method is, the more accurately the true nature of real data and its technical biases need to be modeled in order to draw meaningful conclusions. For example, no simulator generates data from a diploid genome, which would be necessary to evaluate allele specific quantification. Given our extensive experience with RNA-Seq analysis and transcriptomics in general, and our success at building the BEERS simulator, and our track record of authorship on all comprehensive RNA-Seq aligner benchmarking studies published to date, we are ideally situated to develop the next generation of open-source RNA-Seq simulator which aims to model all sources of technical variability. Furthermore, the simulator will model biological variability with an empirical approach based on using real data to configure the simulator?s parameters, which is a natural problem for machine learning. There are eleven steps in RNA-Seq library preparation which introduce bias, all of which will be modeled by the software in an object-oriented modular framework.

Public Health Relevance

There have been many algorithms developed for every step of the RNA-Seq analysis pipeline with no easy way to compare between them. Simulated data are useful for this purpose, but to date there are very few RNA-Seq simulators available and all make too many simplifying assumptions to be used for anything but the most upstream steps in the pipeline, e.g. alignment. We propose to develop the next generation of open-source RNA-Seq simulator, which will capture all of the biochemical processes in a modular fashion and model all of the sources of technical variation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Exploratory/Developmental Grants (R21)
Project #: 5R21LM012763-02
Application #: 9730605
Study Section: Biomedical Library and Informatics Review Committee (BLR)
Program Officer: Ye, Jane

Project Start: 2018-07-01
Project End: 2020-06-30
Budget Start: 2019-07-01
Budget End: 2020-06-30
Support Year: 2
Fiscal Year: 2019
Total Cost
Indirect Cost

Institution

Name: University of Pennsylvania
Department: Genetics
Type: Schools of Medicine
DUNS #: 042250712

City: Philadelphia
State: PA
Country: United States
Zip Code: 19104

Related projects


NIH 2019 R21 LM	The next generation of RNA-Seq simulators for benchmarking analyses Grant, Gregory R. / University of Pennsylvania
NIH 2018 R21 LM	The next generation of RNA-Seq simulators for benchmarking analyses Grant, Gregory R. / University of Pennsylvania

Comments

Be the first to comment on Gregory Grant's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: