This project aims to develop a new class of model selection strategies, known as fence methods. The general idea involves a procedure to isolate a subgroup of what are known as correct models (of which the optimal model is a member). This is accomplished by constructing a statistical fence, or barrier, to carefully eliminate incorrect models. Once the fence is constructed, the optimal model will be selected among the correct models (those within the fence) according to, e.g., simplicity of the models. The last step of the procedure, i.e., the selection of the optimal model within the fence, can be made exible to take scientific or economical considerations into account. The PIs have developed this concept within the context of mixed model selection which includes among other things, linear mixed models and generalized linear mixed models with clustered or non-clustered data. This project aims to: 1) Develop new fence methodology for the problem of gene set analysis from gene expression (microarray) studies. These gene sets represent apriori groupings of genes whose activity is thought to be related (often via biological pathways). Thus it is of interest of know if these groups are perturbed with respect to changing conditions like worsening of disease (in our case, worsening of colon cancer). Knowledge of this would provide insight into which pathways seem to be implicated in poor outcome versus better outcomes, thereby providing potentially novel bio- logical targets for diagnostics or therapeutics. Fence methods for gene set analysis provide a potentially rich class of approaches for tackling such a task.
Aim 1 will develop in detail the theory and optimality of such approaches and then provide comprehensive comparisons to existing methods. The newly developed methods will then be applied to a large repository of colon cancer microarray data which represents the various stages of the disease. Working closely with a biological collaborator, implicated pathways found by the fence will be validated and unravelled biologically. 2) Develop new fence methodology for the problem of analyzing large scale health survey data with the problem of small area estimation in mind. In this case, fence methods will be developed along two tracks - the rst involves allowing a richer class of non-parametric small area estimation mixed models to be used where the degree of smoothing for the xed eects part of the model can be assessed by appropriate fence approaches, and the second involves developing a fence approach that allows one to choose amongst competing small area models based upon prediction quality of small area random effects. In both situations, theory for the fence methods will be developed and the area of application will be a large health care survey collected at NIH. 3) Extend fence methods. Extensions will include new computational approaches known as grating, and also new ways of implementing the fence for association studies with applications to large case-control SNP association studies. Again, detailed theory will be developed and applications undertaken with appropriate collaborators. 4) Develop freeware software to implement the fence methods that will be developed in this project. This software will be written in the statistical package R which will allow users to integrate with other software continually being developed around the world.
Correlated data is widely collected in all of the medical sciences from imaging data to longitudinal clinical trial data to family-based genetic data - all in an effort to better understand the underlying determinants of disease. Mixed models have provided a rich framework to model such data and make best use of the various kinds of structure that naturally are present. However, selecting from a set of competing mixed models has proven to be much more elusive of a problem with little guidance provided from the literature. The PIs of this proposal building on their recent successes in the area, oer a new elegant way to tackle this problem for complex data problems, and will rigorously study their proposed methods statistically, as well as through a variety of interesting applications via collaborations with prominent laboratories at their home institutions and outside. These applications include gene set analysis from gene expression (microarray) studies, association analysis from high throughput SNP studies, and small area estimation from large health survey data.
|Jiang, Jiming; Nguyen, Thuan; Rao, J Sunil (2015) The E-MS Algorithm: Model Selection with Incomplete Data. J Am Stat Assoc 110:1136-1147|
|Nguyen, Thuan; Peng, Jie; Jiang, Jiming (2014) Fence Methods for Backcross Experiments. J Stat Comput Simul 84:644-662|
|Lin, Bingqing; Pang, Zhen; Jiang, Jiming (2013) Fixed and Random Effects Selection by REML and Pathwise Coordinate Optimization. J Comput Graph Stat 22:341-355|
|Dazard, Jean-Eudes; Rao, J Sunil (2012) Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data. Comput Stat Data Anal 56:2317-2333|
|Dazard, Jean-Eudes; Xu, Hua; Rao, J Sunil (2011) R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization. Proc Am Stat Assoc 2011:3849-3863|
|Nguyen, Thuan; Jiang, Jiming (2011) Simple estimation of hidden correlation in repeated measures. Stat Med 30:3403-15|
|Dazard, Jean-Eudes; Rao, J Sunil (2010) Local Sparse Bump Hunting. J Comput Graph Stat 19:900-929|
|Dazard, Jean-Eudes; Rao, J Sunil (2010) Regularized Variance Estimation and Variance Stabilization of High Dimensional Data. Proc Am Stat Assoc 2010:5295-5309|