Cancer is a complex genetic disease, which results from accumulation of multiple genetic defects, including mutations and epigenetic changes. Advancements in microarray techniques make it possible to profile gene expressions of human tissues on a genome-wide scale, with which genomic biomarkers with predictive power for cancer diagnosis and prognosis can be discovered. Such discovery can lead to better understanding of cancer genetics, more accurate prediction of tumor behaviors, and more rational treatment selection. Effective biomarker selection is the key step connecting wet-lab studies with pharmacogenetic practice. The long term goal is to provide more effective and reliable biomarker selection methods, make more efficient use of high dimensional gene expression data, and eventually facilitate clinical practice using genomic measurements. In the present application, we will develop novel clustering penalized methods for biomarker selection in cancer studies with gene expression data. The proposed methods explicitly take into account the cluster nature of gene expressions. They are able to identify a few important gene clusters and a few important genes within those selected clusters that have influential impacts on cancer outcomes such as cancer status, response to treatment and cancer survival. They are expected to provide more accurate gene selection and better prediction than existing methods.
The specific aims are as follows. [1] Propose novel clustering penalized methods for biomarker selection at both the cluster level and the within-cluster gene level. We will propose: (a) Supervised Adaptive Group Lasso- SAGLasso;and (b) Group Bridge Lasso-GBL. We will investigate computational algorithms, tuning parameter selection, evaluation of gene selection and prediction, and large-sample statistical properties. [2] Cancer classification analysis using proposed clustering penalized approaches, where the outcome of interest is categorical cancer status or response to therapy. [3] Cancer survival analysis using proposed clustering penalized approaches, where the outcome is censored survival time. [4] Extensive numerical studies using various cancer gene expression data sets. The approaches developed in Aims 1-3 will be used to analyze ongoing studies as well as publicly available cancer microarray data. We will compare gene selection results and prediction performance of proposed approaches with existing methods. The proposed study will be the first to establish a rigorous statistical framework that explicitly accounts for the cluster nature of gene expressions in cancer biomarker selection. The proposed methods are expected to outperform existing ones in terms of gene selection and prediction performance. We will also investigate cancer classification and survival models in great details and develop efficient algorithms and portable R/S-Plus packages, which make the proposed methods easily accessible for routine biomedical data analysis.