Our MERIT award work will continue to have two main components: involvement in .specific biomedical research projects such as NHBLI's FEHGAS study, and development of new statistical methods appropriate for the analysis of large, complex data sets. These efforts are complementary, with the specific projects ?suggesting which statistical methods are most needed, and also serving as test cases for new methodology. The FEHGAS study, for example,-seeks to predict age of onset of hypertension from SNP data (and background variables such as age and gender). There are 550,000 SNPs available for prediction, most of which will turn out to be useless, making the problem an order of magnitude more challenging, than in expression microarray situations. Efron plans to extend the empirical Bayes methodology from his recent paper to this context, hopefully overcoming the difficulties caused by the usually weak predictive power of individual SNPs. Olshen plans to extend CART (Computer Assisted Regression Trees) and bootstrap methodology to the selection of groups of promising predictive SNPs. Large-scale significance testing, for instance selecting 'significant'genes in a microarray cancer study, has become an area of intense statistical development. Nevertheless, crucial questions of appropriate implementation remain vague in the literature: the choice of an appropriate null hypothesis;the selection of a comparison set (Should all 550,000 SNPs be tested together or separately by chromosome?);and the effects of correlation. We have made some headway in answering these questions, as described in the Progress Report. Our continuing efforts are a combination of methodological implementation and theoretical development. Correlation can have particularly drastic effects on standard statistical techniques. Iii "Are a .set of microarrays independent of each other?" it is shown that a study involving 20,000 genes has its effective sample size reduced to about 17 because of severe gene-wise correlation. We are currently developing diagnostic methods to spot correlation difficulties in massive data sets, and to assess their effects on hypothesis tests, estimates, and predictions. A 20,000 gene microarray study produces 200,000,000 correlations, which sounds oppressively large for practical insight. But we are making progress on an empirical Ba5'es approximation that summarizes correlation, effects in a single number, suitable for simple analysis. Twentieth Century biostatistical applications were overwhelminglyrequentist in nature. Pure: frequentism, though, become impractical for analyzing the large, complex data sets produced by modem biomedical devices, where the relationships of thousands of parameters and millions of data points have to be considered together. We are continuing to develop empirical Bayes methods that allow Bayesian ideas to be brought to bear on questions of multiple inference, without requiring specific prior distributions from the .scientist. A long-term project is to understand how quickly empirical Bayes information accrues in a medical study. A False Discovery Rate is an estimate of the Bayes posterior probability that a gene (or a SNP, or a vowel) is 'null', given the observed data. How many subjects and how many genes do we need to observe in order to get an accurate empirical Bayes estimate of the posterior probability? In our own version of Moore's law, biomedical data sets have increased an order of magnitude in size every few years since the 1990s. Emerging technologies (tiling arrays, bead arrays, aptamer chips, methylation arrays, exon chips, and a variety of new imaging devices) promise further increases, taxing both computational equipment and statistical methodology. Our long-term MERIT goal is to provide algorithms and theory appropriate to massive-data biomedical requirements.
Efron, Bradley (2014) Two modeling strategies for empirical Bayes estimation. Stat Sci 29:285-301 |
Efron, Bradley (2014) Estimation and Accuracy after Model Selection. J Am Stat Assoc 109:991-1007 |
Yoon, Sangho; Assimes, Themistocles L; Quertermous, Thomas et al. (2014) Insulin resistance: regression and clustering. PLoS One 9:e94129 |
Won, Joong-Ho; Lim, Johan; Kim, Seung-Jean et al. (2013) Condition Number Regularized Covariance Estimation. J R Stat Soc Series B Stat Methodol 75:427-450 |
Won, Joong-Ho; Goldberger, Ofir; Shen-Orr, Shai S et al. (2012) Significance analysis of xMap cytokine bead arrays. Proc Natl Acad Sci U S A 109:2848-53 |
Olshen, Adam B; Bengtsson, Henrik; Neuvial, Pierre et al. (2011) Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. Bioinformatics 27:2038-46 |
Chen, Hao; Xing, Haipeng; Zhang, Nancy R (2011) Estimation of parent specific DNA copy number in tumors using high-density genotyping arrays. PLoS Comput Biol 7:e1001060 |
Won, Joong-Ho; Ehret, Georg; Chakravarti, Aravinda et al. (2011) SNPs and other features as they predispose to complex disease: genome-wide predictive analysis of a quantitative phenotype for hypertension. PLoS One 6:e27891 |
Efron, Bradley (2010) Correlated z-values and the accuracy of large-scale statistical estimates. J Am Stat Assoc 105:1042-1055 |
Olshen, Richard A; Rajaratnam, Bala (2010) SUCCESSIVE NORMALIZATION OF RECTANGULAR ARRAYS. Ann Stat 38:1638-1664 |
Showing the most recent 10 out of 12 publications