The investigator studies the problem of finding groups in data ("clustering"). Most existing clustering methods make implicit or explicit assumptions about the shapes of the groups, for example that the groups are roughly spherical or Gaussian, and they will fail if these assumptions are violated. The goal of the project is to develop (i) nonparametric clustering methods capable of finding groups of arbitrary shape; (ii) methods for assessing the statistical validity of the clusters; (iii) tools for visualizing the results. Clustering is cast as a statistical rather than a purely algorithmic problem. The observed data are regarded as a sample from some underlying population, and the goal is to estimate a well defined target characteristic of the population - the cluster tree - from the sample. Adopting a statistical view of clustering has two benefits: (i) it allows comparing the performance of different methods; (ii) it gives meaning to the notion of "cluster validity". Without a sampling model and a target characteristic of the population the question whether clusters are valid or spurious is meaningless. The proposed method for estimating the population cluster tree is based on analyzing a graph over the sample but, unlike most other graph-based clustering methods, it is motivated by the underlying statistical estimation problem. Assessing cluster validity - determining the number of distinct groups, with "one" as a possible answer - has proven a vexing problem, especially in the absence of prior knowledge or assumptions about group shapes. The investigator proposes a novel approach to this problem based on resampling.

Finding groups in data is a problem that occurs in many areas, from genomics (identifying groups of genes with similar function based on gene expression levels measured by DNA microarrays) to information retrieval (spotting topics in document collections) to marketing (determining distinct groups of customers with similar characteristics). Clustering is an exploratory tool, and there typically is little or no prior information about the shapes or the number of groups. It is therefore important to have methods that automatically determine the number of groups and do not rely on assumptions about their shape, and visualization tools that help in understanding the shapes of the groups, their arrangement in feature space, and the influence of parameters of the clustering method on the results.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0505824
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2005-06-15
Budget End
2009-05-31
Support Year
Fiscal Year
2005
Total Cost
$168,986
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195