Since the invention of microarrays, measuring genome-wide gene expression is one of the most common experiments performed by molecular biologists. Gene expression analysis is also widely used in clinical applications to discover the molecular architecture of disease or to develop prognostic and predictive signatures. RNA-sequencing (RNA-seq) has become the preferred technology for making expression measurements due to declining costs and because RNA-seq is flexible enough to measure expression in regions not previously annotated as genes and to measure the abundances of multiple transcripts for individual genes. Now that RNA-seq data can be collected inexpensively and processed in experiments with replicates, a major challenge is statistical modeling and interpretation of results from RNA-seq experiments. Our proposal will tackle three key practical challenges in RNA-seq data analysis: (1) estimation and removal of hidden artifacts, (2) statistical models for differential expression scanning that d not rely on annotation or assembly, and (3) robust statistical models to correct ambiguous, variable, and unidentifiable assemblies, with specific application to the most popular computational RNA-seq software, Cufflinks.
The first aim extends our batch discovery and removal methods to RNA-sequencing data by modeling within gene and spatial dependence in expression estimates that lead to heavily biased artifact estimates and reduced power.
The second aim develops a statistical framework for first identifying regions of differential expressio at base-pair resolution, then associating these regions with known genomic landmarks or annotation as a lightweight and accurate scanning approach. This approach builds on the most mature statistical methods for RNA-seq analysis but does not rely on annotation to define transcriptional units such as genes or exons, allowing for unbiased discovery of differential expression.
The third aim develops a statistical normalization and analysis framework that addresses the most egregious artifacts and limitations of the inherently ambiguous transcript assembly process. We will work closely with the developers of the most popular RNA-seq assembly software, Cufflinks to integrate our developments into that software suite. By modeling variation across genes using functional regression and in the transcript assembly process using hierarchical models we will reduce the number of false positives and increase the reproducibility of alternative transcript differential expression results. The statistical methods we develop will e packaged in freely available open source software that is designed to interact with downstream Bioconductor packages for summarization and visualization such as IRanges or Genominator. The result of this proposal will be a modular, integrated pipeline for analyzing RNA-seq data from raw reads produced by the sequencing machine to easily summarized and visualized tables of robust, interpretable, and reproducible results - thereby increasing the number and range of applications of RNA-seq in molecular biology and medicine.

Public Health Relevance

Genome-wide gene expression measurements are widely used to understand the molecular basis for diseases and to develop predictive and prognostic biomarkers. RNA-sequencing is a new technology for making expression measurements that is more flexible but produces larger and more complex data. We propose to develop statistical methods and software for analyzing these data, accounting for biological and technological errors.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bender, Michael T
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Johns Hopkins University
Biostatistics & Other Math Sci
Schools of Public Health
United States
Zip Code
Pertea, Mihaela; Kim, Daehwan; Pertea, Geo M et al. (2016) Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc 11:1650-67
Jaffe, Andrew E; Shin, Jooheon; Collado-Torres, Leonardo et al. (2015) Developmental regulation of human cortex transcription and its clinical relevance at single base resolution. Nat Neurosci 18:154-61
Jaffe, Andrew E; Hyde, Thomas; Kleinman, Joel et al. (2015) Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis. BMC Bioinformatics 16:372
Pertea, Mihaela; Pertea, Geo M; Antonescu, Corina M et al. (2015) StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33:290-5
Frazee, Alyssa C; Pertea, Geo; Jaffe, Andrew E et al. (2015) Ballgown bridges the gap between transcriptome assembly and expression analysis. Nat Biotechnol 33:243-6
Frazee, Alyssa C; Jaffe, Andrew E; Langmead, Ben et al. (2015) Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics 31:2778-84
Patil, Prasad; Bachant-Winner, Pierre-Olivier; Haibe-Kains, Benjamin et al. (2015) Test set bias affects reproducibility of gene signatures. Bioinformatics 31:2318-23
Leek, Jeffrey T (2014) svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res 42:
Parker, Hilary S; Leek, Jeffrey T; Favorov, Alexander V et al. (2014) Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics 30:2757-63
Parker, Hilary S; Corrada Bravo, Héctor; Leek, Jeffrey T (2014) Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ 2:e561

Showing the most recent 10 out of 11 publications