Privacy is a fundamental individual right. In the era of big data, large amounts of data about individuals are collected both voluntarily (e.g., frequent flier/shopper incentives) and involuntarily (e.g. US Census or medical records). With the ready ability to search for information and correlate it across distinct sources (using data analytics and/or recommender systems), privacy violation takes on an ominous note in this information age.
Anonymization of user information is a classical technique, but is susceptible to correlation attacks: by correlating the anonymized database with another (perhaps publicly available) deanonymized database, a user's privacy could still be divulged. A way out of the limitations of anonymization is to release a randomized database; this offers plausible deniability of any user identity breached via the data release. A systematic way of providing guarantees for the deniability of user presence/absence is the technical field of differential privacy, providing strong privacy guarantees against adversaries with arbitrary side information. It is of fundamental interest to characterize privacy mechanisms that randomize "just enough" to keep the released database as true to the intended one as possible, providing maximal utility.
Based on recent work connecting the areas of information theory and statistical data privacy (via a hypothesis testing context) and demonstrating novel privacy mechanisms that exponentially improve (in terms of variance of noise added, say) upon the state of the art for medium and low privacy regimes, the objective of the project is threefold: (a) characterize the fundamental limits to tradeoffs between privacy and utility in a variety of canonical setting; (b) discover (near) optimal mechanisms that can be efficiently implemented in practice; and (c) seek natural notions of statistical data privacy (beyond differential privacy) using the operational context of hypothesis testing.
Privacy is a central, and multifaceted, social and technological issue of today's information age. This project is focused on the technical aspect of this multifaceted area, and seeks to discover fundamental limits to privacy-utility tradeoffs in the context of currently well established notions of privacy (differential privacy). The expected results expected are fundamental and immediately applicable to a variety of practical settings. Specifically, two concrete practical settings involving genomic data release and smart meter data release will be studied in detail. Due to privacy concerns, genomic and smart meter data is simply unavailable at large -- depriving widespread data analytics and practical implications of such analysis. This project will build and release a software suite of sanitization tools, involving the privacy mechanisms discovered as part of this project.