Subgroup discovery based on high dimensional genomic data can potentially provide novel insights into a disease process. Typically this has been done with various forms of cluster analysis (both supervised and unsupervised). Extreme subgroups are defined as those which are homogeneous in nature but which present extreme valued outcomes. Of particular interest in this project is to develop methodology to identify such subgroups which are extreme with respect to survival outcomes (e.g. those individuals that do unusually well on a cancer treatment and can be delineated based on high dimensional genomic predictors). If such subgroups are real and are uncovered, implications would include improved understanding of the disease etiology, discovery of new biomarkers with potential therapeutic targets, and allow early and personalized therapeutic interventions. Statistically, thi problem can be framed within a sparse survival bump hunting framework. We have brought together a team of biostatisticians who have pioneered the first sparse bump hunting models for continuous responses, as well as two internationally recognized laboratories as collaborators, who work on multi-platform genomic profiling for pediatric medulloblastoma and non-small cell lung cancer respectively. We thus propose the following specific aims: 1) To develop new models for sparse bump hunting that allow survival outcomes with both continuous and nominal predictors (e.g. gene expression and SNPs).;2) To develop a sparse survival bump hunting approach that will allow us to integrate SNP and gene expression profile data by three different approaches - sparse coaching, bump phenotyping and sparse mediation analysis;3) To develop detailed theory for asymptotic performance of these sparse survival bump hunting models;theory for a new fence-based methodology for studying model validation;and to empirically study and compare the performance in detailed simulations as well as on the datasets provided by our collaborator laboratories;4) To develop a Java-based user-friendly interface and a command line end-user CRAN package in the R language that will implement all of our methodologies and its extensions.
One of the questions of interest is to uncover hidden subgroups of individuals with differential survival (say in response to treatment), and characterize the genomic determinants that define these groups. In this work, we will develop a new methodology that is designed to find extreme subgroups of patients within a population. Specific to this research are methods on how to focus the search on genes who relate most strongly to extreme survival and how to integrate various kinds of genomic profiles using three different strategies that are meant to improve subgroup finding and also glean more biological insights.