Medical and biological data often come in the form of sampled curves and images. For example, gene expression arrays are a now widespread technology producing images of the activity of a significant part of a whole genome in a sample of individuals. Many other genomic assays are now emerging, including high-throughput sequencing ("RNA-seq") for measuring RNA abundance. Similarly, electromagnetic brain imaging techniques (MRI, fMRI and EEG) are widely used to study cortical activity in the brain and anatomy. A common feature of such data is that the individual case is high-dimensional, with the number of variables, genes, voxels, or sampling times being large. Often the number of measurements is much larger than the number of cases and there are usually correlations among the components-both raise major challenges for statistical analysis. The broad aim of this ongoing three-investigator grant is to develop new and modify existing statistical techniques to enhance the analysis and interpretation of these data. A common thread in our new projects is the development of models and methods to extract maximal information from these emerging technologies, and to guide the scientist in interpretation of the results. The renewal will address these goals through four Specific Aims. The investigators will study: 1) the Significance analysis of RNA-Seq comparative experiments using Poisson log linear models and a novel procedure to estimate the false discovery rate. Accurate and robust methods for detecting differentially expressed genes are essential for effective use of RNA-seq for scientific research;and 2) the estimation of cortical signals from EEG data using '1 regularization techniques and develop fast, practical, algorithms that offer hope of estimating source activity at a spatial and temporal resolution not seen before;and 3) Power and sample size calculations for multivariate tests, and in particular use recent advances in the statistical application of random matrix theory to develop and evaluate power approximations, make them available in software;and promote more widespread evaluation and use of multivariate methods;and 4) the estimation of the False Discovery Rate for subset regression algorithms applied to modern genomic datasets. A sequential method is proposed that steps through a path of regression solutions. This work will help physical and medical scientists to build effective and interpretable predictive models from large scale datasets. We will implement our statistical tools into publically available software, following a pattern established in earlier cycles of this grant, in which our packages have found wide use among medical researchers both at Stanford and around the world.

Public Health Relevance

Statistical methods such as those to be developed in this project are essential tools to help medical re- searchers discover and validate new basic science results (for example in imaging and genomics) that can lead to new therapies. They aid also in the design and analysis of clinical investigations of new treatments so as to use in the most efficient manner the large amount of data collected in current research, while also accurately describing the degree of uncertainty in the conclusions.

Agency
National Institute of Health (NIH)
Institute
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
Type
Research Project (R01)
Project #
5R01EB001988-17
Application #
8300798
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Peng, Grace
Project Start
1996-09-10
Project End
2015-06-30
Budget Start
2012-07-01
Budget End
2013-06-30
Support Year
17
Fiscal Year
2012
Total Cost
$384,173
Indirect Cost
$129,191
Name
Stanford University
Department
Miscellaneous
Type
Schools of Medicine
DUNS #
009214214
City
Stanford
State
CA
Country
United States
Zip Code
94305
Goldstein-Piekarski, Andrea N; Korgaonkar, Mayuresh S; Green, Erin et al. (2016) Human amygdala engagement moderated by early life stress exposure is a biobehavioral target for predicting recovery on antidepressants. Proc Natl Acad Sci U S A 113:11955-11960
Hughey, Jacob J; Hastie, Trevor; Butte, Atul J (2016) ZeitZeiger: supervised learning for high-dimensional data from an oscillatory system. Nucleic Acids Res 44:e80
Mukherjee, Gourab; Johnstone, Iain M (2015) EXACT MINIMAX ESTIMATION OF THE PREDICTIVE DENSITY IN SPARSE GAUSSIAN MODELS. Ann Stat 43:937-961
Fithian, William; Elith, Jane; Hastie, Trevor et al. (2015) Bias correction in species distribution models: pooling survey and collection data for multiple species. Methods Ecol Evol 6:424-438
Gross, Samuel M; Tibshirani, Robert (2015) Collaborative regression. Biostatistics 16:326-38
Lee, Jason D; Hastie, Trevor J (2015) Learning the Structure of Mixed Graphical Models. J Comput Graph Stat 24:230-253
Lim, Michael; Hastie, Trevor (2015) Learning interactions via hierarchical group-lasso regularization. J Comput Graph Stat 24:627-654
Wager, Stefan; Hastie, Trevor; Efron, Bradley (2014) Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. J Mach Learn Res 15:1625-1651
Reid, Stephen; Tibshirani, Rob (2014) Regularization Paths for Conditional Logistic Regression: The clogitL1 Package. J Stat Softw 58:
Fithian, William; Hastie, Trevor (2014) LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS. Ann Stat 42:1693-1724

Showing the most recent 10 out of 45 publications