Many databases from government, commercial and non-profit organizations maintain a huge amount of data on sensitive or confidential information such as income and medical records. As a result, protecting the privacy and confidentiality of such databases is of primary concern. Data perturbation approach is often adopted when database owners export or publish their sensitive or confidential data. However, it is very hard to quantify and evaluate the tradeoffs between the data utility and the disclosure risk in practice since the data space which is used for disclosure analysis is almost infinite. This project develops a novel model-based disclosure analysis approach which builds statistical models first and analyzes potential disclosure at the models' parameter level. Since the search space of parameters is much smaller than that of data and all information which attackers can derive is contained in those parameters, this approach is more effective and efficient. This project also conducts the theoretical study of perturbation based approach by developing the explicit form between construction accuracy and noise added for various reconstruction methods since previous research only conducted empirical evaluations. The results of this project will provide a prototype system which can fully conduct disclosure analysis using both model based and randomization based approaches to satisfy users' complex privacy and confidentiality specifications.
The system aims to be used by local industry partners and other organizations. Education impacts of this project will include involvement of graduate and undergraduate students and incorporation of research projects into courses related to database security and privacy. Two Ph.D. graduate students will be produced to enhance the nation's capability in information security. All results including publications, empirical studies and software will be disseminated via the project web site (www.cs.uncc.edu/~xwu/career).
The overall objective of this project is to advance theoretical understanding of fundamental issues related to data privacy and develop practical techniques for data disclosure analysis as well as portable course materials that facilitate the education in undergraduate and graduate courses. We conducted the evaluation of potential privacy disclosures of widely used randomization based privacy preserving data mining methods on numerical data, categorical data, and linked network data. Our findings showed that randomization approaches including additive noise perturbation and projection based randomization are vulnerable to various attacks. We discovered four measures used for categorical data analysis have monotonic property and hence data mining tasks based on these measures can be executed on the randomized data directly without knowing the distortion parameters. We examined the attribute disclosure under linking attacks and developed efficient solutions to determine optimal distortion parameters. Our findings showed that the randomization approach can better recover the distribution of original data from the disguised one. As a result, the randomization incurs smaller utility loss (under the same privacy requirements) than generalization and permutation approaches. We also developed feature preserving randomization techniques and the low rank approximation based reconstruction method for linked data that can better preserve data utility while satisfying privacy requirements. To overcome limitations of randomization approaches, we further developed an effective and efficient modeling based privacy preserving data mining approach that first builds an approximate statistical model to describe general databases and conduct both identity disclosure and value disclosure analysis in terms of parameters of the model built. By enforcing privacy protection at the parameter level, we can release the data generated by the model without incurring further privacy disclosures. We developed three portable course modules and incorporated them into graduate seminar courses. We developed and offered a graduate level course on data privacy. We develop two tutorials (randomization based privacy preserving data mining and privacy preservation of graphs and social networks) and present them at three international conferences. Seven Ph.D. students including three female students involved in this project have been exposed to the latest development in the area, and as a result significantly improved their research skills. Three of them successfully defended their dissertations and graduated from UNC Charlotte. We outreached to the broader community of data privacy education via showcasing research results to local high school students, college undergraduate students, industry visitors, and researchers. The research outcomes have impact on areas of privacy preserving data mining, statistical databases, and survey research by improving theoretical understanding of fundamental issues related to privacy and confidentiality preservation in general databases. Government agencies and companies may use our techniques to develop new services and products that better protect customers’ privacy during data collection, analysis, and mining.