The investigator studies models for the unsupervised clustering and hierarchical organization of objects based on similarity data. The research takes a three-pronged approach to meeting the objective of scaling the analysis to large data sets. First the investigator studies powerful latent variable models with relatively few parameters for analyzing similarity data. Second, the investigator develops dramatically faster algorithms for fitting the models. Specifically, the models are fit with combinatorial variants of the EM algorithm which converge much faster than the conventional EM algorithm. Third, the latent variable structure of the models are extended hierarchically, leading to scalable algorithms which hierarchically cluster previously found clusters of objects. Investigations of this type f latent variable hierarchy lead to models and algorithms which scale to large data sets much better than traditional flat models. The similarity analysis in addition extracts out relationships between clusters, and allows for targeted clustering based on a prior specification of cluster relationships of interest.

In this modern data rich age, there is a pressing need for statistical models which can handle large data sets. This investigation focuses on the ubiquitous type of relational data called similarity data, consisting of similarity measurements between pairs of objects. Examples of data which fit into this framework include internet traffic between routers, web connectivity data used by search engines, and microarray gene expression data. There is great interest in finding internet traffic and web topic clusters as well as functional groupings of genes. The investigator studies models and algorithms for clustering and organizational analysis of relational data which can scale to large data sets. The analysis finds meaningful underlying cluster groups along with structural relationships between groups. The methodology the investigator develops has widespread applicability to various disciplines.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0312275
Program Officer
Leland M. Jameson
Project Start
Project End
Budget Start
2003-07-01
Budget End
2008-06-30
Support Year
Fiscal Year
2003
Total Cost
$135,794
Indirect Cost
Name
Rutgers University
Department
Type
DUNS #
City
New Brunswick
State
NJ
Country
United States
Zip Code
08901