The goal of this research project is to enable statistical analysis and knowledge discovery on networks without violating the privacy of participating entities. Network data sets record the structure of computer, communication, social, or organizational networks, but they often contain highly sensitive information about individuals. The availability of network data is crucial for analyzing, modeling, and predicting the behavior of networks.
The team's approach is based on model-based generation of synthetic data, in which a model of the network is released under strong privacy conditions and samples from that model are studied directly by analysts. Output perturbation techniques are used to privately compute the parameters of popular network models. The resulting "noisy" model parameters are released, satisfying a strong, quantifiable privacy guarantee, but still preserving key properties of the networks. Analysts can use the released models to sample individual networks or to reason about properties of the implied ensemble of networks.
By synthesizing versions of networks that would otherwise remain hidden, this research can advance the study of topics such as disease transmission, network resiliency, and fraud detection. The project will result in publicly available privacy tools, a repository for derived models and sample networks, and contributions to workforce development in the field of information assurance. The experimental research is linked to educational efforts including undergraduate involvement in research through a Research Experience for Undergraduates site, as well as interdisciplinary seminars.
For further information see the project web site at the URL: http://dbgroup.cs.umass.edu/private-network-data
This project has developed a set of algorithms for protecting personal privacy while supporting the release of networked data sets. Data privacy research has most commonly been focused on tabular data, in which an individual is described by a set of attributes contained in a single record. Networked data poses a special challenge because it describes a graph in which an edge relation represents connections, interactions, or communication between named nodes. Protecting privacy is more complicated for this type of data: revealing the properties of connected individuals may constitute dangerous disclosures and revealing information about one individual is more likely to lead to inferences about other connected individuals. This project has developed conceptual and technological advancements for modeling networked data sets under the rigorous model of differential privacy. Our basic approach is based on the model-based generation of synthetic data in which a model of the networked data set is released under strong privacy conditions and samples from that model are studied directly by analysts. The data received by analysts must be perturbed or distorted to preserve privacy, however analysts receive measures of estimated error along with synthesized data. The main contributions include the following: We developed algorithms for privately estimating a number of key statistics used with a popular model of network formation (the exponential random graph model). For these statistics, our method allows an analyst to fit this model to the data with improved accuracy. We developed a method for constructing synthetic multi-relational data sets (which generalize networked data beyond a single relationship) also with a rigorous privacy guarantee and improved accuracy. We investigated foundational issues in the statistical modeling of networked data, developing new modeling approaches that increase correctness and descriptive power. The project enhanced cyber-security curricula at the undergraduate and graduate level, added to the cyber-security workforce, and our results were disseminated both nationally and internationally.