Studies carried out at the genome-wide level now play a central role in modern biology and medicine. There continues to be a substantial need for new statistical methods that can be applied in these studies, particularly as study designs become more ambitious, sample sizes increase, and new technologies emerge. The overall goal of the proposed research is to develop statistical methods and software useful in understanding high- throughput molecular profiling data centered around characterizing genome-wide gene expression. We propose to develop statistical models, methods, and software that allow one to rigorously characterize variation of gene expression in terms of both study design and latent variables. Our proposed research is particularly focused on the most modern form of gene expression profiling, RNA-Seq, as well as the most ambitious and biologically fruitful problems currently being studied. We will develop rigorous, flexible, and robust models of variation in high-throughput data that encompass: (i) latent sources of systematic variation and more general sources of dependence among features, (ii) a principled dissection of sources of variation in next-generation RNA-Seq data and new methods for emerging RNA-Seq data, (iii) rigorous multiple hypothesis testing of associations between genomics features and latent variables, (iv) simultaneous inference of complex, yet commonly sought after statistical hypotheses, and (v) dissemination to the greater research community through user-friendly and platform independent software packages.
Measuring genome-wide gene expression variation has been a revolutionary tool in biomedicine over the past decade. This is a data-centric endeavor that requires new and sophisticated statistical methods in order to arrive at sound biological conclusions. The proposed work will make novel contributions to statistical methods and software that will be applied to genome-wide gene expression studies in humans and model organisms, for both microarray data and next generation sequencing data.
|Chung, Neo Christopher; Storey, John D (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics 31:545-54|
|Ochoa, Alejandro; Storey, John D; LlinÃ¡s, Manuel et al. (2015) Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol 11:e1004509|
|Robinson, David G; Chen, Wei; Storey, John D et al. (2014) Design and analysis of Bar-seq experiments. G3 (Bethesda) 4:11-8|
|Marstrand, Troels T; Storey, John D (2014) Identifying and mapping cell-type-specific chromatin programming of gene expression. Proc Natl Acad Sci U S A 111:E645-54|
|Kim, Jinhee; Ghasemzadeh, Nima; Eapen, Danny J et al. (2014) Gene expression profiles associated with acute myocardial infarction and risk of cardiovascular death. Genome Med 6:40|
|Robinson, David G; Storey, John D (2014) subSeq: determining appropriate sequencing depth through efficient read subsampling. Bioinformatics 30:3424-6|
|Jaffe, Andrew E; Storey, John D; Ji, Hongkai et al. (2013) Gene set bagging for estimating the probability a statistically significant result will replicate. BMC Bioinformatics 14:360|
|Leek, Jeffrey T; Johnson, W Evan; Parker, Hilary S et al. (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28:882-3|
|Woo, Sangsoon; Leek, Jeffrey T; Storey, John D (2011) A computationally efficient modular optimal discovery procedure. Bioinformatics 27:509-15|
|Gresham, David; Boer, Viktor M; Caudy, Amy et al. (2011) System-level analysis of genes and functions affecting survival during nutrient starvation in Saccharomyces cerevisiae. Genetics 187:299-317|
Showing the most recent 10 out of 22 publications