This project focuses on two new universally applicable methods for clustering. A method is proposed for the clustering of explanatory variables in a response oriented fashion, ROVAC (Response Oriented Variable Clustering). The method uses a response model to generate the variable clustering, via a novel application of bagging. The clustering does not rely on a distribution assumption for the explanatory variables. The method is flexible, allowing for the use of different model selection criteria. It generalizes to a wide variety of response models. The PI is also investigating extensions of a clustering visualization and validation tool recently developed, the Relative Data Depth (ReD). Building on the concept of the depth relative to regression the PI develops methods for selecting the number of clusters in a data set and selecting the features that are related to a specific clustering.
This project is largely motivated by interdisciplinary research. The goal is to provide scientists in related fields with new and flexible clustering tools for analyzing high-dimensional data. Standard methods for clustering or grouping of features require the definition of a measure of similarity. This is often a non-trivial and highly subjective task. In this project the PI focuses on the development of two clustering techniques based on intuitively simple concepts. The first method uses the knowledge of another measured quantity, a response. The method groups features together that are similarly related to the response. The second method uses a concept of depth, a measure of how representative a feature is with respect to its group. Prototype algorithms are being implemented on real data with examples from, but not limited to, gene expression data. Preliminary results are competitive with current leading methodologies.