Medical and biological data often come in the form of sampled curves and images. For example, gene expression arrays are a now widespread technology producing images of the activity of a significant part of a whole genome in a sample of individuals. Many other genomic assays are now emerging, including high-throughput sequencing (""""""""RNA-seq"""""""") for measuring RNA abundance. Similarly, electromagnetic brain imaging techniques (MRI, fMRI and EEG) are widely used to study cortical activity in the brain and anatomy. A common feature of such data is that the individual case is high-dimensional, with the number of variables, genes, voxels, or sampling times being large. Often the number of measurements is much larger than the number of cases and there are usually correlations among the components-both raise major challenges for statistical analysis. The broad aim of this ongoing three-investigator grant is to develop new and modify existing statistical techniques to enhance the analysis and interpretation of these data. A common thread in our new projects is the development of models and methods to extract maximal information from these emerging technologies, and to guide the scientist in interpretation of the results. The renewal will address these goals through four Specific Aims. The investigators will study: 1) the Significance analysis of RNA-Seq comparative experiments using Poisson log linear models and a novel procedure to estimate the false discovery rate. Accurate and robust methods for detecting differentially expressed genes are essential for effective use of RNA-seq for scientific research;and 2) the estimation of cortical signals from EEG data using '1 regularization techniques and develop fast, practical, algorithms that offer hope of estimating source activity at a spatial and temporal resolution not seen before;and 3) Power and sample size calculations for multivariate tests, and in particular use recent advances in the statistical application of random matrix theory to develop and evaluate power approximations, make them available in software;and promote more widespread evaluation and use of multivariate methods;and 4) the estimation of the False Discovery Rate for subset regression algorithms applied to modern genomic datasets. A sequential method is proposed that steps through a path of regression solutions. This work will help physical and medical scientists to build effective and interpretable predictive models from large scale datasets. We will implement our statistical tools into publically available software, following a pattern established in earlier cycles of this grant, in which our packages have found wide use among medical researchers both at Stanford and around the world.

Public Health Relevance

Statistical methods such as those to be developed in this project are essential tools to help medical re- searchers discover and validate new basic science results (for example in imaging and genomics) that can lead to new therapies. They aid also in the design and analysis of clinical investigations of new treatments so as to use in the most efficient manner the large amount of data collected in current research, while also accurately describing the degree of uncertainty in the conclusions.

National Institute of Health (NIH)
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Peng, Grace
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
Schools of Medicine
United States
Zip Code
Taylor, Jonathan; Tibshirani, Robert (2018) Post-Selection Inference for ?1-Penalized Likelihood Models. Can J Stat 46:41-61
Donoho, David L; Gavish, Matan; Johnstone, Iain M (2018) Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model. Ann Stat 46:1742-1778
Pataki, Camille I; Rodrigues, João; Zhang, Lichao et al. (2018) Proteomic analysis of monolayer-integrated proteins on lipid droplets identifies amphipathic interfacial ?-helical membrane anchors. Proc Natl Acad Sci U S A 115:E8172-E8180
Johnstone, Iain M (2018) Tail sums of Wishart and Gaussian eigenvalues beyond the bulk edge. Aust N Z J Stat 60:65-74
Johnstone, Iain M; Paul, Debashis (2018) PCA in High Dimensions: An orientation. Proc IEEE Inst Electr Electron Eng 106:1277-1292
Reid, Stephen; Newman, Aaron M; Diehn, Maximilian et al. (2018) Genomic Feature Selection by Coverage Design Optimization. J Appl Stat 45:2658-2676
Powers, Scott; Qian, Junyang; Jung, Kenneth et al. (2018) Some methods for heterogeneous treatment effect estimation in high dimensions. Stat Med 37:1767-1787
Groll, Andreas; Hastie, Trevor; Tutz, Gerhard (2017) Selection of effects in Cox frailty models by regularization methods. Biometrics 73:846-856
Johnstone, I M; Nadler, B (2017) Roy's largest root test under rank-one alternatives. Biometrika 104:181-193
Reid, Stephen; Tibshirani, Robert (2016) Sparse regression and marginal testing using cluster prototypes. Biostatistics 17:364-76

Showing the most recent 10 out of 61 publications