Our MERIT award work will continue to have two main components: involvement in .specific biomedical research projects such as NHBLI's FEHGAS study, and development of new statistical methods appropriate for the analysis of large, complex data sets. These efforts are complementary, with the specific projects ?suggesting which statistical methods are most needed, and also serving as test cases for new methodology. The FEHGAS study, for example,-seeks to predict age of onset of hypertension from SNP data (and background variables such as age and gender). There are 550,000 SNPs available for prediction, most of which will turn out to be useless, making the problem an order of magnitude more challenging, than in expression microarray situations. Efron plans to extend the empirical Bayes methodology from his recent paper to this context, hopefully overcoming the difficulties caused by the usually weak predictive power of individual SNPs. Olshen plans to extend CART (Computer Assisted Regression Trees) and bootstrap methodology to the selection of groups of promising predictive SNPs. Large-scale significance testing, for instance selecting 'significant'genes in a microarray cancer study, has become an area of intense statistical development. Nevertheless, crucial questions of appropriate implementation remain vague in the literature: the choice of an appropriate null hypothesis;the selection of a comparison set (Should all 550,000 SNPs be tested together or separately by chromosome?);and the effects of correlation. We have made some headway in answering these questions, as described in the Progress Report. Our continuing efforts are a combination of methodological implementation and theoretical development. Correlation can have particularly drastic effects on standard statistical techniques. Iii "Are a .set of microarrays independent of each other?" it is shown that a study involving 20,000 genes has its effective sample size reduced to about 17 because of severe gene-wise correlation. We are currently developing diagnostic methods to spot correlation difficulties in massive data sets, and to assess their effects on hypothesis tests, estimates, and predictions. A 20,000 gene microarray study produces 200,000,000 correlations, which sounds oppressively large for practical insight. But we are making progress on an empirical Ba5'es approximation that summarizes correlation, effects in a single number, suitable for simple analysis. Twentieth Century biostatistical applications were overwhelminglyrequentist in nature. Pure: frequentism, though, become impractical for analyzing the large, complex data sets produced by modem biomedical devices, where the relationships of thousands of parameters and millions of data points have to be considered together. We are continuing to develop empirical Bayes methods that allow Bayesian ideas to be brought to bear on questions of multiple inference, without requiring specific prior distributions from the .scientist. A long-term project is to understand how quickly empirical Bayes information accrues in a medical study. A False Discovery Rate is an estimate of the Bayes posterior probability that a gene (or a SNP, or a vowel) is 'null', given the observed data. How many subjects and how many genes do we need to observe in order to get an accurate empirical Bayes estimate of the posterior probability? In our own version of Moore's law, biomedical data sets have increased an order of magnitude in size every few years since the 1990s. Emerging technologies (tiling arrays, bead arrays, aptamer chips, methylation arrays, exon chips, and a variety of new imaging devices) promise further increases, taxing both computational equipment and statistical methodology. Our long-term MERIT goal is to provide algorithms and theory appropriate to massive-data biomedical requirements.

Agency
National Institute of Health (NIH)
Institute
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
Type
Method to Extend Research in Time (MERIT) Award (R37)
Project #
5R37EB002784-39
Application #
8613317
Study Section
Special Emphasis Panel (NSS)
Program Officer
Peng, Grace
Project Start
1993-01-15
Project End
2015-01-31
Budget Start
2014-02-01
Budget End
2015-01-31
Support Year
39
Fiscal Year
2014
Total Cost
$373,909
Indirect Cost
$138,718
Name
Stanford University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
009214214
City
Stanford
State
CA
Country
United States
Zip Code
94305
Efron, Bradley (2014) Two modeling strategies for empirical Bayes estimation. Stat Sci 29:285-301
Efron, Bradley (2014) Estimation and Accuracy after Model Selection. J Am Stat Assoc 109:991-1007
Yoon, Sangho; Assimes, Themistocles L; Quertermous, Thomas et al. (2014) Insulin resistance: regression and clustering. PLoS One 9:e94129
Won, Joong-Ho; Lim, Johan; Kim, Seung-Jean et al. (2013) Condition Number Regularized Covariance Estimation. J R Stat Soc Series B Stat Methodol 75:427-450
Won, Joong-Ho; Goldberger, Ofir; Shen-Orr, Shai S et al. (2012) Significance analysis of xMap cytokine bead arrays. Proc Natl Acad Sci U S A 109:2848-53
Olshen, Adam B; Bengtsson, Henrik; Neuvial, Pierre et al. (2011) Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. Bioinformatics 27:2038-46
Chen, Hao; Xing, Haipeng; Zhang, Nancy R (2011) Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput Biol 7:e1001060
Won, Joong-Ho; Ehret, Georg; Chakravarti, Aravinda et al. (2011) SNPs and other features as they predispose to complex disease: genome-wide predictive analysis of a quantitative phenotype for hypertension. PLoS One 6:e27891
Efron, Bradley (2010) Correlated z-values and the accuracy of large-scale statistical estimates. J Am Stat Assoc 105:1042-1055
Olshen, Richard A; Rajaratnam, Bala (2010) SUCCESSIVE NORMALIZATION OF RECTANGULAR ARRAYS. Ann Stat 38:1638-1664

Showing the most recent 10 out of 12 publications