Privacy concerns can prevent constructing a centralized data warehouse to support data mining. For example, the Centers for Disease Control (CDC) may want to mine insurance companies' data to identify trends and patterns in disease outbreaks, such as understanding and predicting the progression of a flu epidemic. Gathering all patient data into a single warehouse increases opportunities for privacy breaches and misuse. We propose an alternative: secure collaborative computing between the parties holding the data that produce the desired data mining results, while provably preventing disclosure of private data.
This project will enable knowledge discovery under the following assumptions: 1. data are distributed across multiple sources, with security/privacy concerns that limit data sharing, and 2. if data were gathered into a centralized warehouse, data mining tools could identify patterns or relationships that give beneficial knowledge. Developed techniques will replicate or approximate the results of centralized data mining, with quantifiable limits on the disclosure of data from each site. The goal is to develop a toolkit of privacy-preserving distributed computation techniques that can be assembled to solve specific real-world problems. By simplifying component assembly so it becomes development rather than research, widespread use of privacy-preserving distributed data mining will become feasible.