Many application domains, such as intelligence, counter-terrorism, forensics, disease control, often need to cross-match multiple very large datasets, such as watch lists. Because those datasets may contain privacy-sensitive or confidential information, the use of efficient privacy-preserving protocols for cross-matching different datasets is crucial. The problem of privacy-preserving record matching has been addressed by the use of Secure Multi-party Computation (SMC) protocols. Under these protocols, the data are converted to series of functions with private inputs. However a major drawback of SMC-based protocols is that they involve extensive cryptographic primitives such as homomorphic encryption which do not scale to the size of practical problems. As a result, SMC-based protocols cannot be used for resource constrained data-intensive privacy-preserving record matching approaches directly. This project develops a novel approach based on the observation that to apply SMC to practical applications, one needs to bridge the gap between the size of the datasets that can efficiently be matched using SMC protocols and the size of the datasets seen in practice. The approach taken by the project tackles the problem from a novel angle by developing techniques to reduce the size of practical problems by employing privacy-preserving data sanitization methods. The project thus solves the privacy-preserving data matching problems through the following steps. First, to protect the privacy of data subjects, useful statistics about data is gathered using differential privacy. Second, differentially private statistics are shared among the parties involved in data matching. These parties then identify potential matching pairs where fruitful matching may occur. Such a step is referred to as data blocking. Finally, SMC techniques are applied to these candidates to accurately cross-match information. In addition to syntactic matching, semantic matching is supported by which records are compared according to some semantic similarity functions. The semantic matching protocols includes techniques for matching and aligning ontologies, as the use of ontologies is crucial for an effective semantic matching. This project is the first to use differential privacy for efficient privacy-preserving record matching that also leverages semantics-based approach and a privacy-preserving approach to ontology alignment. The techniques developed in the project are the first to achieve efficient privacy-preserving matching of large scale data sets using differential privacy, thus overcoming the scalability problems of conventional SMC techniques. The approach developed in this project expands the opportunities and contexts for data use by enabling the cross-match of multiple data archives, possibly owned by different parties, without violating the privacy of the data. Many applications, of interest for our society, will benefit by such opportunities. For further information see the project web site at the URL: www.cs.purdue.edu/homes/bertino/prirelink

Project Report

The project has focused on the problem of using data so that the privacy of individuals to which the data refers to is assured. This is today a pressing problem given the amount of data which is collected by many different parties, such internet providers, search engines, governmental agencies, mobile devices, and sensors. It is important to notice that such data can have many different usages benefitting our society. One example is represented by medical data which is invaluable for medical resarch to progress. Therefore, a crucial issue is how we can use the data in a meaningful way while at the same time assuring the privacy of the data. The goal of this project is to allow parties that need the data for their tasks (such as counter-terrorism and medical research) to use the data and extract useful knowledge from this data without violating the privacy of the honest citizens.The project has designed and analyzed several methods that make this possible. Unlike previous work which is unable to work on large datasets, the methods developed in this project are efficient and carefully combine different techniques so to be practically applicable. Another important area of applications for the methods developed in this project is represented by cloud. Our methods make it possible to store the data on a cloud and assure the privacy of the data from unauthorized accesses while on the cloud.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1016722
Program Officer
Jeremy Epstein
Project Start
Project End
Budget Start
2010-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$240,000
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907