In today's information-centric networked world, concerns about protecting identities and other private information are growing in importance. It is important to establish not only a legal baseline but also a technological baseline that protects such information. At the same time, data search and analysis technologies are emerging that are capable of processing extremely large volumes of information. As a result, research is needed into data analysis technologies that enhance privacy and protect private information in a computationally efficient manner. One important area of technology that supports data analysis is optimization, a set of procedures that make a system as efficient as possible. Solving optimization problems efficiently has been one of the major themes of computer science throughout the history of the field. Unfortunately many optimization problems that need to be solved in practice are unlikely to have efficient algorithmic solutions. To cope with this difficulty, computer scientists have developed numerous practical approximation algorithms along with general techniques for designing such algorithms. Often the optimization problems that need to be solved arise from the analysis of real data with potential privacy restrictions. Importantly, there are no known general tools to design approximation algorithms which are both efficient and provably private.

As an initial case study, community discovery in social network analysis will be studied. It is a natural candidate for private approximation for two reasons: first, the underlying social network data in many cases can be sensitive; second, community structure should not depend crucially on any single relation in the network, and, therefore, it should be possible to find a private community discovery algorithm with good utility. The intellectual merit of the project is in the research required to develop general methods for designing efficient differentially private approximation algorithms for combinatorial optimization problems. The project's broader impacts include applications in real-world law enforcement and counterterrorism.

Project Report

In today's information-centric networked world, concerns about protecting identities and other private information are growing in importance. It is important to establish not only a legal baseline but also a technological baseline that protects such information. At the same time, data search and analysis technologies are emerging that are capable of processing extremely large volumes of information. As a result, data analysis technologies should incorporate privacy and protect private information in a computationally efficient manner. Differential privacy has emerged as a useful mathematical definition of privacy, along with a variety of methods for provably providing differential privacy in different contexts as well as surprising connections to other areas such as learning theory. One important area of technology that supports data analysis is optimization, a set of procedures that make a system as efficient as possible. Solving optimization problems efficiently has been one of the major themes of computer science throughout the history of the field. This project advanced the understanding and application of differential privacy, the state of the art in optimization methods, and the ability to achieve the two together. Project research resulted in a theoretical framework for a differentially private mechanism that provides privacy and utility for a large class of combinatorial optimization problems, including set cover, shortest path, minimum vertex cover, maximum matching, minimum cut, graph clustering, etc. In particular, we considered differential privacy for combinatorial optimization problems whose input can be represented by a matrix of 0-1 entries for which a "neighbor" relationship based on changing a single entry makes sense. (In the case of adjacency matrices for graphs this corresponds to the existing notion of "edge privacy".) We provide a novel solution for probabilistic differential privacy for such problems. The solution can also make use of approximation algorithms to improve efficiency with a potential degradation in utility while maintaining privacy. Unlike existing differential privacy solutions, our solution does not depend on the "global sensitivity" of the underlying problem. We have experimentally validated the efficiency and utility of our approach for a variety of combinatorial and graph problems. Project results also include DP-WHERE, an application of differential privacy to human mobility modeling. Models of human mobility have broad applicability in urban planning, ecology, epidemiology, and other fields. Starting with Call Detail Records (CDRs) from a cellular telephone network that have gone through a straightforward anonymization procedure, the prior WHERE modeling approach produces synthetic CDRs for a synthetic population. The accuracy of WHERE has been validated against billions of location samples for hundreds of thousands of cell phones in the New York and Los Angeles metropolitan areas. DP-WHERE provides provable privacy guarantees by modifying WHERE to be differentially private. Experimental analysis for a large New York City data set examined the accuracy/privacy tradeoff for DP-WHERE relative to WHERE and of real CDRs. This work shows the feasibility for the creation and possible release of synthetic models that capture the mobility patterns of real metropolitan populations while preserving privacy. Project research focused on the development of SemRel, a novel Probabilistic Graphical Model for the discovery of semantic relations in text and a novel approach to preserve entities' privacy. The model improves the interpretability of topic models by using a richer relational feature space, but without jeopardizing the privacy of the entities and relations involved. In contrast to the previously used Gibbs sampling, we derived algorithms for stochastic variational inference, which can be learned online and thus scale to larger datasets. We also laid the theoretical foundation for inferring SemRel models in a manner that preserves differential privacy with respect to the entities in the relations in the input corpus. Project work on pan-privacy for streaming data considered fully dynamic data, where we track data as it gets inserted and deleted. Pan privacy provides differential privacy while computing desired statistics on the data, even if the internal memory of the algorithm is compromised. Project research explored pan-private algorithms for basic analyses, like estimating distinct count, moments, and heavy hitter count. Results include the first known pan-private algorithms for these problems in the fully dynamic model. Our algorithms rely on sketching techniques popular in streaming: in some cases, we add suitable noise to a previously known sketch, using a novel approach of calibrating noise to the underlying problem structure and the projection matrix of the sketch; in other cases, we maintain certain statistics on sketches; in yet others, we define novel sketches. We also present the first known lower bounds explicitly for pan privacy, showing our results to be nearly optimal for these problems. Our lower bounds are stronger than those implied by differential privacy or dynamic data streaming alone and hold even if unbounded memory and/or unbounded processing time are allowed. The lower bounds use a noisy decoding argument and exploit a connection between pan-private algorithms and data sanitization.

Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$499,274
Indirect Cost
Name
Rutgers University
Department
Type
DUNS #
City
Piscataway
State
NJ
Country
United States
Zip Code
08854