A New Paradigm for Classification Based on Dissimilarity Information Via Regularized Kernel Estimation
Grace Wahba, PI
The objective of this research is to develop improved methods for classification and clustering when attribute vectors for the objects of interest are either not known or or are of a much higher dimension than is useful, but when dissimilarity information between pairs of objects is available. In the work being proposed, this dissimilarity information may be subjective, crude, noisy, incomplete, confined within a nonlinear manifold, come from multiple sources and/or be inconsistent. The approach is to build on some preliminary work by the PI and collaborators, who have initiated two new robust nonparametric methods for obtaining positive definite kernels (a.k.a "reproducing kernels") from noisy dissimilarity data under various circumstances. These kernels generate "pseudo-attribute" vectors which may be used for clustering, for outlier detection, or in a support vector machine with copiously labeled data, or with sparsely labeled data ("semi-supervised learning") for classification. Tasks are proposed to build a series of optimized classification systems under a variety of scientifically important scenarios regarding the nature of the data available, which combine robustly estimated kernels with support vector machines to effect classification based on dissimilarity information. It is proposed to develop theoretically valid and practically useful optimization procedures and efficient algorithms for these systems, test the results in carefully designed test beds where the answer is known, apply them to a variety of different classification tasks, compare the results with related systems, and publicize the results.
With the availability of extremely large amounts of data and high speed computing, modern classification tools are doing impressive things in speech recognition, text classification, image analysis, and classification of proteins and microarray data, among other things. However there is still much room for improvement in certain areas. This work will provide a unique and novel contribution to the theory and practice of classification when the data available may be subjective, crude, noisy, incomplete, satisfy complex constraints, come from multiple sources and may be inconsistent. It is anticipated that the proposed work will provide improved methods of statistical analysis that have the potential to seriously impact essentially any engineering or scientific endeavor that collects data to be classified.