The digital age is predicated on information being ubiquitous. The ability to access relevant data stored on remote servers "in the cloud" has become an indispensable resource in everyday lives. Numerous online services let users query public datasets for data items such as map directions, stock quotes, and flight prices, to name a few. Digital content providers also rely on user queries to identify the content desired by the user. Unfortunately, such queries have the potential to reveal highly-sensitive information *about the users*, thereby compromising their privacy. For example, institutional investors querying a stock-market database for the value of certain stocks may prefer not to reveal their interest in these stocks since it could influence their price. As another example, most people are deeply uncomfortable with exposing their media consumption diet to a centralized server that can be targeted by hacking or subpoena. It can be convincingly argued that access to such media consumption profiles can reveal the person's sexual orientation, political leanings, and cultural affiliations.

A great deal of research has been devoted to methods that guarantee the security and integrity of *the data*. Much less work, however, has been devoted to protecting the privacy of *the user*. Efficient and reliable retrieval of information from distributed databases, with information-theoretic guarantees of user privacy, is the focus of this project. The relevant area of research is known as private information retrieval. While most of the existing results in this area are theoretical, a major goal in this project is to bridge the gap between the theory of private information retrieval and the practice of distributed storage. As such, potential outcomes of this investigation may extend beyond the scope of academic research, contributing to new technologies and products.

Private information retrieval (PIR), conceived in the seminal papers of Chor, Goldreich, Kushilevitz, and Sudan over 20 years ago, has been traditionally studied in theoretical computer science and cryptography, with emphasis on the complexity of the communication between the user and the servers that store the database. While major breakthroughs have been achieved in this area over the years, the prevailing paradigm has always been that of replicating the database on several non-communicating servers. Such replication leads to a significant storage overhead, which is undesirable. Moreover, motivated by advances in coding for distributed storage, it was recently recognized that if database replication is replaced by *database coding*, the full power of coding-theoretic methods can be brought to bear on the problem. While extremely promising, this line of research is still in its infancy. The goal of this project is to follow-up on the database coding idea, and follow it through to its ultimate potential. In pursuit of this goal, the following questions are addressed:

(1) What is the information-theoretic capacity of PIR? That is, what is the maximum amount of information that can be privately retrieved per downloaded bit, under various scenarios?

(2) What is the optimal storage overhead of PIR? Can we achieve both privacy and efficient communication (on the download and upload) without replicating the stored data even once?

(3) How can both privacy and download efficiency be maintained in the presence of impediments such as malicious or colluding servers, unsynchronized data, and/or communication errors?

(4) Codes for distributed storage systems tolerate and repair node failures while making the data available to several users at once. How can we combine PIR protocols with such coding?

(5) What is the best possible tradeoff between the various desirable PIR features, such as download efficiency, storage overhead, and resilience to errors/collusions/node-failures?

This proposal is a natural outgrowth of the research recently carried out by the PIs and others in this area. Prior related work will provide a springboard for rapid progress toward the ambitious research objectives of this project. Knowledge, techniques, and qualitative insights gained through this investigation are expected to contribute to the foundations of the field, and to help bridge the gap between the theory of PIR and the practice of distributed storage.

Project Start
Project End
Budget Start
2017-09-01
Budget End
2020-08-31
Support Year
Fiscal Year
2017
Total Cost
$450,000
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093