The goal of this project is to make possible database searching in a privacy-constrained manner: A private database provider allows only properly authorized searches by clients, in a manner that does not reveal the search criteria yet enforces the requirement that the client learns only what is authorized by the search. The initial focus will be on techniques for the case of exact matches, later extended to the much more difficult case of approximate matching. If multiple matches are found, either all of them are returned, or a subset of the "best" of them, under appropriately defined notions of quality, is returned; in approximate matching there is a natural notion of quality, namely, having smaller distance to the target of the search as specified by the query. The technical challenges include verification of the validity of a search request, and then carrying out the search, in manner that enforces the search's authorized criteria without revealing them. The project holds the promise of leading to substantial improvements in the highly unsatisfactory current "state of the practice" for searches carried out on private and sensitive databases, that unnecessarily reveal too much information and prevent useful collaborations from taking place due to concerns over the leakage of sensitive information. The minimal-disclosure feature of the protocols will also make possible a de facto "defense in depth", in that a compromised server will no longer automatically imply the compromise of all the clients' interactions with that server.
One of the major impediments to larger-scale use of cloud services is concern over confidentiality of the data, and of the queries and computations carried out on it by the cloud servers. In this project we developed techniques, insights, and proof of concept implementations for achieving confidentiality when using untrusted commercial cloud services, whereby the cloud servers store and manipulate the data on behalf of its owners without learning anything about either the data or the queries and computations done on it. Within such a secure and private outsourcing framework, we obtained results for comparisons of DNA sequences, querying of DNA databases, matrix computations, image processing computations, and many other variants of information storage and retrieval (including range queries). The protocols we designed place the bulk of the computational burden on the cloud servers, so that the client can carry out its share of the computations even when it is computationally weak (e.g., slow, or battery-limited, or having limited local storage — as in portable devices like cell phones and tablets). In all cases, the cloud servers do not learn the data — they store it and carry out computations on it "without seeing it". Only the client learns the answers to the queries and computations. For performance reasons, we avoided or minimized the use of expensive cryptographic primitives like exponentiation, and as much as possible limited their use to the cloud servers rather than at the client side. In all of our protocols, cheating by an untrusted server is detected by the client with high probability. In some protocols the server needs to produce, in addition to the answer, a proof of its integrity and completeness (e.g., for database queries). We highlight below two of our results, selected both because of their technical depth and because they involve DNA data, for which privacy and confidentiality are particularly important. Our work on secure and private outsourcing of DNA sequence comparisons significantly improves the state of the art: (i) The client does linear work and communication, whereas the previous best was quadratic; (ii) the protocol works in 1 round whereas the previous took a quadratic number of rounds; (iii) the server space usage is linear whereas the previous best was quadratic; and (iv) the protocol uses only lightweight cryptography whereas the previous used expensive homomorphic encryption and required oblivious transfer. Our result on the querying of encrypted DNA databases improves the state of the art in many significant ways. Our scheme is deterministic, with zero probability of a wrong answer (as opposed to a low probability for the previous best). Moreover, our encoding of the data makes it possible for us to handle a richer set of queries than exact matching between the query and each sequence of the database, including: (i) counting the number of matches between the query symbols and a sequence; (ii) logical OR matches where a query symbol is allowed to match a subset of the alphabet thereby making it possible to handle (as a special case) a "not equal to" requirement for a query symbol (e.g., "not a G"); (iii) queries that specify the number of occurrences of each kind of symbol in the specified sequence positions (e.g., two ‘A’ and four ‘C’ and one ‘G’ and three ‘T’, occurring in any order in the query-specified sequence positions); (iv) a threshold query whose answer is ‘yes’ if the number of matches exceeds a query-specified threshold (e.g., "7 or more matches out of the 15 query-specified positions"). For all query types we can hide the answers from the decrypting server, so that only the client learns the answer. In all cases, the client deterministically learns only the query's answer, except for query type (iv) where we quantify the (very small) statistical leakage to the client of the actual count.