Medical and biological data often come in the form of sampled curves and images. For example, gene expression arrays are a now widespread technology producing images of the activity of a significant part of a whole genome in a sample of individuals. Many other genomic assays are now emerging, including high-throughput sequencing ("RNA-seq") for measuring RNA abundance. Similarly, electromagnetic brain imaging techniques (MRI, fMRI and EEG) are widely used to study cortical activity in the brain and anatomy. A common feature of such data is that the individual case is high-dimensional, with the number of variables, genes, voxels, or sampling times being large. Often the number of measurements is much larger than the number of cases and there are usually correlations among the components-both raise major challenges for statistical analysis. The broad aim of this ongoing three-investigator grant is to develop new and modify existing statistical techniques to enhance the analysis and interpretation of these data. A common thread in our new projects is the development of models and methods to extract maximal information from these emerging technologies, and to guide the scientist in interpretation of the results. The renewal will address these goals through four Specific Aims. The investigators will study: 1) the Significance analysis of RNA-Seq comparative experiments using Poisson log linear models and a novel procedure to estimate the false discovery rate. Accurate and robust methods for detecting differentially expressed genes are essential for effective use of RNA-seq for scientific research;and 2) the estimation of cortical signals from EEG data using '1 regularization techniques and develop fast, practical, algorithms that offer hope of estimating source activity at a spatial and temporal resolution not seen before;and 3) Power and sample size calculations for multivariate tests, and in particular use recent advances in the statistical application of random matrix theory to develop and evaluate power approximations, make them available in software;and promote more widespread evaluation and use of multivariate methods;and 4) the estimation of the False Discovery Rate for subset regression algorithms applied to modern genomic datasets. A sequential method is proposed that steps through a path of regression solutions. This work will help physical and medical scientists to build effective and interpretable predictive models from large scale datasets. We will implement our statistical tools into publically available software, following a pattern established in earlier cycles of this grant, in which our packages have found wide use among medical researchers both at Stanford and around the world.

Public Health Relevance

Statistical methods such as those to be developed in this project are essential tools to help medical re- searchers discover and validate new basic science results (for example in imaging and genomics) that can lead to new therapies. They aid also in the design and analysis of clinical investigations of new treatments so as to use in the most efficient manner the large amount of data collected in current research, while also accurately describing the degree of uncertainty in the conclusions.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Peng, Grace
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Schools of Medicine
United States
Zip Code
Viladomat, JĂșlia; Mazumder, Rahul; McInturff, Alex et al. (2014) Assessing the significance of global and local correlations under spatial autocorrelation: a nonparametric approach. Biometrics 70:409-18
Sen, Nandini; Mukherjee, Gourab; Sen, Adrish et al. (2014) Single-cell mass cytometry analysis of human tonsil T cell remodeling by varicella zoster virus. Cell Rep 8:633-45
Dharmawansa, Prathapasinghe; Johnstone, Iain M (2014) Joint density of eigenvalues in spiked multivariate models. Stat 3:240-249
Nowak, Gen; Hastie, Trevor; Pollack, Jonathan R et al. (2011) A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics 12:776-91
Witten, Daniela M; Tibshirani, Robert (2011) Penalized classification using Fisher's linear discriminant. J R Stat Soc Series B Stat Methodol 73:753-772
Witten, Daniela M; Tibshirani, Robert (2010) A framework for feature selection in clustering. J Am Stat Assoc 105:713-726
Johnstone, Iain M (2010) High dimensional Bernstein-von Mises: simple examples. Inst Math Stat Collect 6:87-98
Friedman, Jerome; Hastie, Trevor; Tibshirani, Rob (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw 33:1-22
Johnstone, Iain M; Titterington, D Michael (2009) Statistical challenges of high-dimensional data. Philos Transact A Math Phys Eng Sci 367:4237-53

Showing the most recent 10 out of 17 publications