SI2-SSE: Scalable Big Data Clustering by Random Projection Hashing

Wilsey, Philip

Abstract

This project plans to develop a distributed algorithm for secure clustering of high dimensional data sets. Fields in health and biology are significantly benefited by data clustering scalability. Bioinformatic problems such as Micro Array clustering, Protein-Protein interaction clustering, medical resource decision making, medical image processing, and clustering of epidemiological events all serve to benefit from larger dataset sizes. The algorithm under development, called Random Projection Hash or RPHash, utilizes aspects of locality sensitive hashing (LSH) and multi-probe random projection for computational scalability and linear achievable gains from parallel speed. Furthermore, RPHash provides data anonymization through destructive manipulation of the data preventing de-anonymization attacks beyond standard best practices database security methods. RPHash will be deployable on commercially available cloud resources running the Hadoop (MRv2) implementation of MapReduce. The exploitation of general purpose cloud processing solutions allows researchers to scale their processing needs using virtually limitless commercial processing resources.

The RPHash algorithm uses various recent techniques in data mining along with a new approach toward achieving algorithmic scalability on distributed systems. The basic intuition of RPHash is to combine multi-probe random projection with discrete space quantization. Regions of high density are then regarded as centroid candidates. To follow common parameterized, k-means methods, the top k regions will be selected. The focus on a randomized, and thus non-deterministic, clustering algorithm is somewhat uncommon in computing, but common for ill-posed, combinatorially restrictive problems such as clustering and partitioning. Despite theoretical results showing that k-means has an exponential worst case complexity, many real world problems tend to fair much better under k-means and other similar algorithms.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Advanced CyberInfrastructure (ACI)
Type: Standard Grant (Standard)
Application #: 1440420
Program Officer: Bogdan Mihaila

Project Start
Project End
Budget Start: 2014-09-01
Budget End: 2019-08-31
Support Year
Fiscal Year: 2014
Total Cost: $498,127
Indirect Cost

SI2-SSE: Scalable Big Data Clustering by Random Projection Hashing
Wilsey, Philip
University of Cincinnati, Cincinnati, OH, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments