We propose to provide clustering software for very large databases and for categorical data. Investigators in virtually all areas of research seek to discover patterns and relationships in data. Computer intensive exploratory analysis, or data mining, is having a huge impact in science and industry (e.g. Berkhin 2002, Maitra 2002). However, the availability of software for obtaining partitions and for their visualization lags far behind the proliferation of proposed methods and the growth in size of available databases. We believe that implementing new algorithms for clustering of large datasets that may include non-numeric attributes, and visualizing cluster properties will open new opportunities for data analysis. ? ? In Phase I, we developed scalable implementations of clustering methods, including k-means and its extensions to categorical and mixed mode data, and demonstrated that we could discover things about data through a combination of clustering and visualization that neither alone could provide. Our ultimate goal in Phases II and III is to develop a modular addition to the S-PLUS language called S+CLUSTER that provides the following key features: ? ? - A suite of clustering algorithms suitable for large and possibly high-dimensional datasets that may include categorical attributes; ? - Extensive capabilities for visual data exploration of the results of clustering; and ? - Tools for validation and diagnostics facilitating objective assessment of clustering results. ? ? We intend to create software that is flexible and easy to use, and which should enable the analysis and understanding of data from a wide range of applications. Clustering or unsupervised classification has been used in genetics research, protein classification, psychiatric research, analysis of biomedical signals, segmentation of medical images, etc. The software will be part of an integrated environment for data analysis, and it will permit the customization of the clustering process, which will extend the ability of biomedical researchers to understand complex data. New insights into microarrays, epidemiological data and protein database may have high potential in drug discovery, disease diagnosis, and treatment. ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Center for Research Resources (NCRR)
Type
Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #
2R44RR016386-02
Application #
7107367
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Arora, Krishan
Project Start
2001-07-01
Project End
2008-09-29
Budget Start
2006-09-30
Budget End
2007-09-29
Support Year
2
Fiscal Year
2006
Total Cost
$359,770
Indirect Cost
Name
Insightful Corporation
Department
Type
DUNS #
150683779
City
Seattle
State
WA
Country
United States
Zip Code
98109