Protection of individual privacy is a top concern when releasing and sharing data. This project seeks better and more practical ways to enhance the utility of released individual-level data without compromising individual privacy. Differential privacy provides a robust concept for privacy protection in mathematical terms without making assumptions about the background knowledge of data intruders. Despite a strong interest in practice to adopt differential privacy when releasing data, reluctance exists because of various reasons - e.g., potentially a high level of noise injected into the original data to achieve differential privacy, especially in high-dimensional data; and the lack of user-friendly software and tools to implement differentially private algorithms in practice. This project develops techniques and tools to create synthetic "surrogate datasets" with the same structure as the original data, satisfying differential privacy while offering sufficient information for valid and accurate population-level statistical analysis. The project evaluates the proposed work with simulated data, the census-record-based ADULT dataset frequently used in anonymization studies, and a medical dataset with clinical, biospecimen, and genetic attributes from Parkinson's patients, benchmarked against current practice for releasing different private data. The work is being featured in several community outreach programs to stimulate interests in STEM careers among K-12 students.
The project first establishes theoretical and methodological foundations, including but not limited to extending the multiplicative weighting mechanism to handle nonlinear queries and numerical data, establishing a theory that guarantees individual privacy protection in the released surrogate data, and focusing on achieving the statistical validity of inferences based on the surrogate data. The reduction of the necessary noise level to achieve differential privacy leverages the state-of-the-art dimensional reduction techniques and the inherent properties of the multiplicative weighting mechanism. The method is evaluated by simulation studies and applications to real-life datasets (including social/financial data and health care data) benchmarked against other methodologies for releasing individual-level data. Finally, open-source software is being developed for release on the Comprehensive R Archive Network and GitHub that produces surrogate datasets, along with examples and documents to explain the disclosure risk and the supported utility of data analysis.