Clustering is a crucial component of exploratory data analysis. This project aims to develop novel algorithms that address various shortcomings of k-means, the most widely used clustering algorithm. Specific objectives include (a) the development of initialization methods to address the sensitivity of k-means to the initial cluster centers; (b) the investigation of alternative distance measures to address the sensitivity of k-means to outliers; and (c) the development of practical acceleration methods. Innovations developed during this project should be readily applicable to a wide range of clustering algorithms. For example, initialization is crucial for most clustering algorithms. Furthermore, an effective initialization method can be used independently of k-means as a standalone clustering algorithm. K-means is often used as a subroutine in other learning algorithms. Therefore, development of acceleration methods for k-means is of great practical interest.
The project is expected to make broader impacts on several fronts. At the international level, we intend to make significant contributions to the data mining literature by publishing in top-ranked journals. At the national level, we aim to enhance the competitiveness of the US by seeding the next generation of scientists. At the regional level, we hope to improve the quality of education in an EPSCoR state and contribute to the development of a diverse and skilled workforce. At the institutional level, we intend to improve the research environment and the curriculum of our Computer Science program. Finally, at the individual level, we hope to increase the participation of students from underrepresented groups in research and equip them with valuable skills including self-confidence, independent thinking/problem solving, and effective communication.
This project involves the enhancement of k-means, a well-known non-hierarchical data clustering algorithm. Clustering, the unsupervised classification of patterns into groups, is one of the most important tasks in exploratory data analysis. Primary goals of clustering include gaining insight into, classifying, and compressing data (Jain et al., "Data Clustering: A Review,"' ACM Computing Surveys 31(3): 264-323, 1999.) Clustering has a long and rich history that spans a variety of scientific disciplines including anthropology, biology, medicine, psychology, statistics, mathematics, engineering, and computer science. As a result, numerous clustering algorithms have been proposed since theearly 1950s. Among these algorithms, non-hierarchical (partitional) ones have found many applications, especially in engineering and computer science. K-means has been recently selected as the most influential clustering algorithm used in computer science related fields ("The Top Ten Algorithms in Data Mining," Editors: X. Wu and V. Kumar, Chapman and Hall/CRC, 2009.) Therefore, any improvement to this algorithm is likely to benefit a large number of academics and industry professionals, who routinely use this algorithm for exploratory data analyses. The most serious problem associated with k-means is its sensitivity to initialization. In this project, we have developed several approaches to address this problem and improved the state-of-the-art in this field. Another problem with k-means is its computational requirements. Even though k-means has linear time complexity, for large and/or high-dimensional data sets, e.g., web document data sets, the algorithm can be computationally demanding. We have developed a simple and highly practical method to accelerate k-means, which should make the algorithm more applicable to such data sets. This project has already made a significant impact in the research and education environment of the Louisiana State University in Shreveport, a primarily undergraduate, non-doctoral institution that serves significant numbers of students from underrepresented groups. The research training and education supported by this project will contribute to the development of a skilled workforce in the state of Louisiana.