Studies carried out at the genome-wide level now play a central role in modern biology and medicine. There continues to be a substantial need for new statistical methods that can be applied in these studies, particularly as study designs become more ambitious, sample sizes increase, and new technologies emerge. The overall goal of the proposed research is to develop statistical methods and software useful in understanding high- throughput molecular profiling data centered around characterizing genome-wide gene expression. We propose to develop statistical models, methods, and software that allow one to rigorously characterize variation of gene expression in terms of both study design and latent variables. Our proposed research is particularly focused on the most modern form of gene expression profiling, RNA-Seq, as well as the most ambitious and biologically fruitful problems currently being studied. We will develop rigorous, flexible, and robust models of variation in high-throughput data that encompass: (i) latent sources of systematic variation and more general sources of dependence among features, (ii) a principled dissection of sources of variation in next-generation RNA-Seq data and new methods for emerging RNA-Seq data, (iii) rigorous multiple hypothesis testing of associations between genomics features and latent variables, (iv) simultaneous inference of complex, yet commonly sought after statistical hypotheses, and (v) dissemination to the greater research community through user-friendly and platform independent software packages.

Public Health Relevance

Measuring genome-wide gene expression variation has been a revolutionary tool in biomedicine over the past decade. This is a data-centric endeavor that requires new and sophisticated statistical methods in order to arrive at sound biological conclusions. The proposed work will make novel contributions to statistical methods and software that will be applied to genome-wide gene expression studies in humans and model organisms, for both microarray data and next generation sequencing data.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Struewing, Jeffery P
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Princeton University
Organized Research Units
United States
Zip Code
Hackett, Sean R; Zanotelli, Vito R T; Xu, Wenxin et al. (2016) Systems-level analysis of mechanisms regulating yeast metabolic flux. Science 354:
Ochoa, Alejandro; Storey, John D; LlinĂ¡s, Manuel et al. (2015) Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol 11:e1004509
Chung, Neo Christopher; Storey, John D (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics 31:545-54
Robinson, David G; Wang, Jean Y; Storey, John D (2015) A nested parallel experiment demonstrates differences in intensity-dependence between RNA-seq and microarrays. Nucleic Acids Res 43:e131
Marstrand, Troels T; Storey, John D (2014) Identifying and mapping cell-type-specific chromatin programming of gene expression. Proc Natl Acad Sci U S A 111:E645-54
Robinson, David G; Storey, John D (2014) subSeq: determining appropriate sequencing depth through efficient read subsampling. Bioinformatics 30:3424-6
Robinson, David G; Chen, Wei; Storey, John D et al. (2014) Design and analysis of Bar-seq experiments. G3 (Bethesda) 4:11-8
Kim, Jinhee; Ghasemzadeh, Nima; Eapen, Danny J et al. (2014) Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Med 6:40
Jaffe, Andrew E; Storey, John D; Ji, Hongkai et al. (2013) Gene set bagging for estimating the probability a statistically significant result will replicate. BMC Bioinformatics 14:360
Leek, Jeffrey T; Johnson, W Evan; Parker, Hilary S et al. (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28:882-3

Showing the most recent 10 out of 25 publications