This project plans to develop a distributed algorithm for secure clustering of high dimensional data sets. Fields in health and biology are significantly benefited by data clustering scalability. Bioinformatic problems such as Micro Array clustering, Protein-Protein interaction clustering, medical resource decision making, medical image processing, and clustering of epidemiological events all serve to benefit from larger dataset sizes. The algorithm under development, called Random Projection Hash or RPHash, utilizes aspects of locality sensitive hashing (LSH) and multi-probe random projection for computational scalability and linear achievable gains from parallel speed. Furthermore, RPHash provides data anonymization through destructive manipulation of the data preventing de-anonymization attacks beyond standard best practices database security methods. RPHash will be deployable on commercially available cloud resources running the Hadoop (MRv2) implementation of MapReduce. The exploitation of general purpose cloud processing solutions allows researchers to scale their processing needs using virtually limitless commercial processing resources.

The RPHash algorithm uses various recent techniques in data mining along with a new approach toward achieving algorithmic scalability on distributed systems. The basic intuition of RPHash is to combine multi-probe random projection with discrete space quantization. Regions of high density are then regarded as centroid candidates. To follow common parameterized, k-means methods, the top k regions will be selected. The focus on a randomized, and thus non-deterministic, clustering algorithm is somewhat uncommon in computing, but common for ill-posed, combinatorially restrictive problems such as clustering and partitioning. Despite theoretical results showing that k-means has an exponential worst case complexity, many real world problems tend to fair much better under k-means and other similar algorithms.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1440420
Program Officer
Bogdan Mihaila
Project Start
Project End
Budget Start
2014-09-01
Budget End
2019-08-31
Support Year
Fiscal Year
2014
Total Cost
$498,127
Indirect Cost
Name
University of Cincinnati
Department
Type
DUNS #
City
Cincinnati
State
OH
Country
United States
Zip Code
45221