Since the invention of microarrays, measuring genome-wide gene expression is one of the most common experiments performed by molecular biologists. Gene expression analysis is also widely used in clinical applications to discover the molecular architecture of disease or to develop prognostic and predictive signatures. RNA-sequencing (RNA-seq) has become the preferred technology for making expression measurements due to declining costs and because RNA-seq is flexible enough to measure expression in regions not previously annotated as genes and to measure the abundances of multiple transcripts for individual genes. Now that RNA-seq data can be collected inexpensively and processed in experiments with replicates, a major challenge is statistical modeling and interpretation of results from RNA-seq experiments. Our proposal will tackle three key practical challenges in RNA-seq data analysis: (1) estimation and removal of hidden artifacts, (2) statistical models for differential expression scanning that d not rely on annotation or assembly, and (3) robust statistical models to correct ambiguous, variable, and unidentifiable assemblies, with specific application to the most popular computational RNA-seq software, Cufflinks.
The first aim extends our batch discovery and removal methods to RNA-sequencing data by modeling within gene and spatial dependence in expression estimates that lead to heavily biased artifact estimates and reduced power.
The second aim develops a statistical framework for first identifying regions of differential expressio at base-pair resolution, then associating these regions with known genomic landmarks or annotation as a lightweight and accurate scanning approach. This approach builds on the most mature statistical methods for RNA-seq analysis but does not rely on annotation to define transcriptional units such as genes or exons, allowing for unbiased discovery of differential expression.
The third aim develops a statistical normalization and analysis framework that addresses the most egregious artifacts and limitations of the inherently ambiguous transcript assembly process. We will work closely with the developers of the most popular RNA-seq assembly software, Cufflinks to integrate our developments into that software suite. By modeling variation across genes using functional regression and in the transcript assembly process using hierarchical models we will reduce the number of false positives and increase the reproducibility of alternative transcript differential expression results. The statistical methods we develop will e packaged in freely available open source software that is designed to interact with downstream Bioconductor packages for summarization and visualization such as IRanges or Genominator. The result of this proposal will be a modular, integrated pipeline for analyzing RNA-seq data from raw reads produced by the sequencing machine to easily summarized and visualized tables of robust, interpretable, and reproducible results - thereby increasing the number and range of applications of RNA-seq in molecular biology and medicine.

Public Health Relevance

Genome-wide gene expression measurements are widely used to understand the molecular basis for diseases and to develop predictive and prognostic biomarkers. RNA-sequencing is a new technology for making expression measurements that is more flexible but produces larger and more complex data. We propose to develop statistical methods and software for analyzing these data, accounting for biological and technological errors.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM105705-02
Application #
8722575
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bender, Michael T
Project Start
2013-09-01
Project End
2018-04-30
Budget Start
2014-05-01
Budget End
2015-04-30
Support Year
2
Fiscal Year
2014
Total Cost
Indirect Cost
Name
Johns Hopkins University
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
City
Baltimore
State
MD
Country
United States
Zip Code
21218
Jaffe, Andrew E; Tao, Ran; Norris, Alexis L et al. (2017) qSVA framework for RNA quality correction in differential expression analysis. Proc Natl Acad Sci U S A 114:7130-7135
Collado-Torres, Leonardo; Nellore, Abhinav; Kammers, Kai et al. (2017) Reproducible RNA-seq analysis using recount2. Nat Biotechnol 35:319-321
Kammers, Kai; Taub, Margaret A; Ruczinski, Ingo et al. (2017) Integrity of Induced Pluripotent Stem Cell (iPSC) Derived Megakaryocytes as Assessed by Genetic and Transcriptomic Analysis. PLoS One 12:e0167794
Nellore, Abhinav; Jaffe, Andrew E; Fortin, Jean-Philippe et al. (2016) Human splicing diversity and the extent of unannotated splice junctions across human RNA-seq samples on the Sequence Read Archive. Genome Biol 17:266
Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650-67
Nellore, Abhinav; Wilks, Christopher; Hansen, Kasper D et al. (2016) Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce. Bioinformatics 32:2551-3
Frazee, Alyssa C; Pertea, Geo; Jaffe, Andrew E et al. (2015) Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat Biotechnol 33:243-6
Patil, Prasad; Bachant-Winner, Pierre-Olivier; Haibe-Kains, Benjamin et al. (2015) Test set bias affects reproducibility of gene signatures. Bioinformatics 31:2318-23
Jaffe, Andrew E; Shin, Jooheon; Collado-Torres, Leonardo et al. (2015) Developmental regulation of human cortex transcription and its clinical relevance at single base resolution. Nat Neurosci 18:154-161
Jaffe, Andrew E; Hyde, Thomas; Kleinman, Joel et al. (2015) Practical impacts of genomic data ""cleaning"" on biological discovery using surrogate variable analysis. BMC Bioinformatics 16:372

Showing the most recent 10 out of 16 publications