Statistical models for biological and technical variation in RNA sequencing

Leek, Jeffrey

Abstract

Since the invention of microarrays, measuring genome-wide gene expression is one of the most common experiments performed by molecular biologists. Gene expression analysis is also widely used in clinical applications to discover the molecular architecture of disease or to develop prognostic and predictive signatures. RNA-sequencing (RNA-seq) has become the preferred technology for making expression measurements due to declining costs and because RNA-seq is flexible enough to measure expression in regions not previously annotated as genes and to measure the abundances of multiple transcripts for individual genes. Now that RNA-seq data can be collected inexpensively and processed in experiments with replicates, a major challenge is statistical modeling and interpretation of results from RNA-seq experiments. Our proposal will tackle three key practical challenges in RNA-seq data analysis: (1) estimation and removal of hidden artifacts, (2) statistical models for differential expression scanning that d not rely on annotation or assembly, and (3) robust statistical models to correct ambiguous, variable, and unidentifiable assemblies, with specific application to the most popular computational RNA-seq software, Cufflinks.
The first aim extends our batch discovery and removal methods to RNA-sequencing data by modeling within gene and spatial dependence in expression estimates that lead to heavily biased artifact estimates and reduced power.
The second aim develops a statistical framework for first identifying regions of differential expressio at base-pair resolution, then associating these regions with known genomic landmarks or annotation as a lightweight and accurate scanning approach. This approach builds on the most mature statistical methods for RNA-seq analysis but does not rely on annotation to define transcriptional units such as genes or exons, allowing for unbiased discovery of differential expression.
The third aim develops a statistical normalization and analysis framework that addresses the most egregious artifacts and limitations of the inherently ambiguous transcript assembly process. We will work closely with the developers of the most popular RNA-seq assembly software, Cufflinks to integrate our developments into that software suite. By modeling variation across genes using functional regression and in the transcript assembly process using hierarchical models we will reduce the number of false positives and increase the reproducibility of alternative transcript differential expression results. The statistical methods we develop will e packaged in freely available open source software that is designed to interact with downstream Bioconductor packages for summarization and visualization such as IRanges or Genominator. The result of this proposal will be a modular, integrated pipeline for analyzing RNA-seq data from raw reads produced by the sequencing machine to easily summarized and visualized tables of robust, interpretable, and reproducible results - thereby increasing the number and range of applications of RNA-seq in molecular biology and medicine.

Public Health Relevance

Genome-wide gene expression measurements are widely used to understand the molecular basis for diseases and to develop predictive and prognostic biomarkers. RNA-sequencing is a new technology for making expression measurements that is more flexible but produces larger and more complex data. We propose to develop statistical methods and software for analyzing these data, accounting for biological and technological errors.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 1R01GM105705-01A1
Application #: 8593469
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Bender, Michael T

Project Start: 2013-09-01
Project End: 2018-04-30
Budget Start: 2013-09-01
Budget End: 2014-04-30
Support Year: 1
Fiscal Year: 2013
Total Cost: $307,800
Indirect Cost: $117,800

Institution

Name: Johns Hopkins University
Department: Biostatistics & Other Math Sci
Type: Schools of Public Health
DUNS #: 001910777

City: Baltimore
State: MD
Country: United States
Zip Code: 21218

Related projects


NIH 2017 R01 GM	Statistical models for biological and technical variation in RNA sequencing Leek, Jeffrey T. / Johns Hopkins University
NIH 2016 R01 GM	Statistical models for biological and technical variation in RNA sequencing Leek, Jeffrey T. / Johns Hopkins University
NIH 2015 R01 GM	Statistical models for biological and technical variation in RNA sequencing Leek, Jeffrey T. / Johns Hopkins University	$307,800
NIH 2014 R01 GM	Statistical models for biological and technical variation in RNA sequencing Leek, Jeffrey T. / Johns Hopkins University
NIH 2013 R01 GM	Statistical models for biological and technical variation in RNA sequencing Leek, Jeffrey T. / Johns Hopkins University	$307,800

Publications

Jaffe, Andrew E; Tao, Ran; Norris, Alexis L et al. (2017) qSVA framework for RNA quality correction in differential expression analysis. Proc Natl Acad Sci U S A 114:7130-7135

Collado-Torres, Leonardo; Nellore, Abhinav; Kammers, Kai et al. (2017) Reproducible RNA-seq analysis using recount2. Nat Biotechnol 35:319-321

Kammers, Kai; Taub, Margaret A; Ruczinski, Ingo et al. (2017) Integrity of Induced Pluripotent Stem Cell (iPSC) Derived Megakaryocytes as Assessed by Genetic and Transcriptomic Analysis. PLoS One 12:e0167794

Nellore, Abhinav; Jaffe, Andrew E; Fortin, Jean-Philippe et al. (2016) Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol 17:266

Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650-67

Nellore, Abhinav; Wilks, Christopher; Hansen, Kasper D et al. (2016) Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics 32:2551-3

Jaffe, Andrew E; Shin, Jooheon; Collado-Torres, Leonardo et al. (2015) Developmental regulation of human cortex transcription and its clinical relevance at single base resolution. Nat Neurosci 18:154-161

Jaffe, Andrew E; Hyde, Thomas; Kleinman, Joel et al. (2015) Practical impacts of genomic data ""cleaning"" on biological discovery using surrogate variable analysis. BMC Bioinformatics 16:372

Pertea, Mihaela; Pertea, Geo M; Antonescu, Corina M et al. (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33:290-5

Frazee, Alyssa C; Jaffe, Andrew E; Langmead, Ben et al. (2015) Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31:2778-84

Showing the most recent 10 out of 16 publications

Comments

Be the first to comment on Jeffrey Leek's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: