This project develops variable selection procedures in the context of grouped predictors, where the grouping structure can be either known or unknown. Known grouping structures include the case of Analysis of Variance (ANOVA) models where multiple `dummy' variables represent a single factor, or in nonparametric regression using basis functions. In these situations, it is desired to either include or exclude the entire set of variables as a group. Unknown grouping structure occurs when there are underlying clusters of predictors that have a combined effect on the response, such as a set of genes sharing a common pathway. This research addresses the issue of variable selection under both known and unknown grouping structures, while simultaneously addressing additional goals specific to the problem at hand. The first component of this project is combining `supervised clustering' and variable selection into a single step to facilitate the identification of important predictive clusters when the underlying grouping structure is unknown. Secondly, for the known grouping structure, this project develops penalization techniques to perform the grouped selection while additionally allowing for the enforcement of hierarchical constraints. The third component of the project is the development of a technique to perform the typical pairwise comparison post-hoc analysis in ANOVA within the factor selection process. All three components are developed in a penalization framework by appropriate choices of the penalty.
With the abundance of information now available in all scientific fields, it can be an overwhelming task to decide on which of the massive number of possible characteristics, or variables, are important. Therefore, it is essential to develop techniques to perform variable selection. It is often the case that there is an underlying group structure that the scientist would like to discover as well. One common example occurs in gene expression studies, in which classification of patients into disease subtypes based on their gene expression profile is a major focus. Among the thousands of genes in a gene expression study, there is only a small fraction of them that are actually useful indicators of disease status, and many of the genes can be combined into functional groups. The investigator's research is particularly geared toward enabling the accomplishment of these types of multi-faceted analyses, such as finding the relevant genes while also identifying the group structure. A general theme of the research is that appropriately designed statistical procedures can achieve multiple objectives simultaneously and in an integrated fashion. The importance of the variable selection problem across all disciplines, and the investigator's collaborations with medical researchers and other scientists allows the results to be readily disseminated into the applied research community where it can be used to improve the quality of life.