RNA sequencing (RNA-seq) is a powerful new technology for mapping and quantifying transcriptomes using next generation ultra-high-throughput sequencing technologies. Although extremely promising, massive data produced by RNA-seq, substantial biases, and uncertainty in short read alignment pose daunting challenges for researchers when analyzing RNA-seq data. Most of the current analytic programs enumerate total number of tags landed within each exon and use normalized counts as the expression measure. Such methods ignore variation and correlation in sequencing depth within an exon, which may result in less accurate expression measures. Because the correlation between the read counts of adjacent bases depends on the distance between them, it is referred to as spatial correlation. Large base-specific variations and between-bases spatial correlations make naive approaches, such as averaging to normalizing RNA-seq data and quantifying gene/isoform expressions, ineffective. The presence of location-specific variation as well as spatial correlation is an outstanding characteristic of many spatial data in Geostatistics, Spatial Epidemiology, and image processing, and it has been studied in the literature of Spatial Statistics. In this project, the investigators propose to apply and extend the ideas, models and methodologies rooted in Spatial Statistics to model and analyze RNA-seq data. In particular, the investigators develop spatial Poisson mixed effects models including a hierarchical model and a mixture model to accommodate biases, variations, and correlations present in RNA-seq data so as to accurately estimate gene/isoform expression levels and to facilitate gene/isoform expression comparison and novel transcript structure or activities discovery. Furthermore, the investigators will apply the proposed methods to analyze real RNA-seq data generated from prostate cancer and psoriasis transcriptomic studies.

Monitoring gene expression levels genome-wide is important for understanding the mechanisms of many biological processes. In the past decade, microarray has been the main tool in laboratories for measuring gene expression levels. Recently, RNA-seq, an emerging new technology, has been shown to offer key advantages over microarray in measuring gene expression profiles. However, existing methods for quantifying expression levels from RNA-seq data are crude and unsatisfactory. This greatly compromises the power of RNA-seq for genomic and transcriptomic studies. In this project, having carefully investigated the unique characteristics of RNA-seq data, the investigators propose a series of advanced statistical models, and aim to develop effective and efficient methods for RNA-seq data analysis. The methods generated from this project will greatly benefit a fast growing community of researchers who are planning to conduct RNA-seq experiments with data analysis. Furthermore, this project also constitutes a significant contribution to the advance of statistical methodology development. The investigators will also develop and support open-source computer software for RNA-seq data analysis based on the methods resulting from this project and make it freely available to the public online.

Project Report

in this project, we developed advanced statistical model to characterize base-level RNA-seq count data. Using the models and algorithms (POME, PM-seq), we are able to achive more accurate and less biases quantitative measures of gene expression or transcript expression levels using teh RNA-seq technologies. Our project made significant contributions to the Biostatistics and Bioinformatics fields. To be specific, the POME and PM-seq approaches PM-Seq adopted a different approach to normalize RNA-Seq data and quantify gene expression levels then existing methods. As a result, our methods lead to more consistent and accurate measurements of gene expression levels and is found to outperform most existing methods such as RPKM. We also developed novel Bayesian statistical methods that utilizing the large amount of microarray expression and RNA-seq data stored in the public databases to design informative priors for inference problems such as detecting differentially expressed genes. RNA-seq is poised to replace microarray as the method of choice for transcriptomic research. The methods developed in this project is expected to appeal to the fast growing community of researchers who are planning to conduct RNA-seq experiments. We will develop and support open-source computer software freely available online and provide training at the graduate level by advising students to conduct research. Therefore, the results from this project will contribute to research and scientific discovery in a variety of areas including biology, medical research.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1000617
Program Officer
Mary Ann Horn
Project Start
Project End
Budget Start
2010-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2010
Total Cost
$368,567
Indirect Cost
Name
Emory University
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30322