RNA-Sequencing (RNA-Seq) has established itself as the primary method for studying transcription in basic research, with an emerging role in the clinic ? currently upwards of 5,000 publications using the technology are indexed in PubMed. However, the interpretation of RNA-Seq requires several complex operations including alignment, quantification, normalization and statistical analyses of various types. Since its inception a large number of algorithms have appeared for each step, creating a very confusing landscape for investigators. In order to determine the best analysis practices, numerous benchmarking studies have emerged which leverage real RNA-Seq data made from well-studied RNA samples, such as the Genetic European Variation in Health and Disease (GEUVADIS) consortium data. These valuable RNA-Seq datasets contain the biases and errors introduced by sequencing biochemistry?factors that any analysis method must account for and overcome. However, the utility of such datasets for benchmarking analysis methods is limited by the fact that we do not know the underlying truth (e.g. the true number of RNA molecules from each transcript in the original sample). Therefore researchers tend to rely heavily on simulated data, since we know everything about the true composition of these samples. There are dozens of DNA simulators aimed at benchmarking applications such as variant calling. And while the need for simulators is just as strong in RNA analysis, there are only a scant few RNA-Seq simulators available. Furthermore, the available RNA- Seq simulators are based on simplifying assumptions that greatly restrict their utility for benchmarking anything but the most upstream steps in the analysis pipeline (e.g. alignment). The further downstream the analysis method is, the more accurately the true nature of real data and its technical biases need to be modeled in order to draw meaningful conclusions. For example, no simulator generates data from a diploid genome, which would be necessary to evaluate allele specific quantification. Given our extensive experience with RNA-Seq analysis and transcriptomics in general, and our success at building the BEERS simulator, and our track record of authorship on all comprehensive RNA-Seq aligner benchmarking studies published to date, we are ideally situated to develop the next generation of open-source RNA-Seq simulator which aims to model all sources of technical variability. Furthermore, the simulator will model biological variability with an empirical approach based on using real data to configure the simulator?s parameters, which is a natural problem for machine learning. There are eleven steps in RNA-Seq library preparation which introduce bias, all of which will be modeled by the software in an object-oriented modular framework.
There have been many algorithms developed for every step of the RNA-Seq analysis pipeline with no easy way to compare between them. Simulated data are useful for this purpose, but to date there are very few RNA-Seq simulators available and all make too many simplifying assumptions to be used for anything but the most upstream steps in the pipeline, e.g. alignment. We propose to develop the next generation of open-source RNA-Seq simulator, which will capture all of the biochemical processes in a modular fashion and model all of the sources of technical variation.