This research project will develop statistical methodology for preserving privacy and sharing of large, complex, and highly structured data and models. These types of data are commonly encountered in finance, longitudinal studies, wearable device studies, medical imaging, and electronic health records. Complex data present substantial challenges for preserving subjects' privacy while making data that will advance scientific understanding and policy making publicly available. The project will make major theoretical and methodological contributions to statistical data privacy and to the fields it relies on, such as statistics and computer science. Formal privacy tools now are being adopted by major companies and government agencies for sharing data summaries. This research will demonstrate how even large complex structures, such as human faces, can be made private if that structure is properly exploited. The methods to be developed will have applications in the social, behavioral, and economic sciences, medical research, and industry. The investigators will mentor a post-doctoral researcher, as well as graduate and undergraduate students. Open-source software packages will be developed and made publicly available.
This interdisciplinary research project will improve upon methods in statistical disclosure limitation, differential privacy, and functional data analysis to develop formal privacy tools. These tools are essential in the era of big data. The project will focus on three aims: (1) Development of privacy tools for objects in infinite-dimensional linear spaces, especially functions and surfaces. These tools will include non-Gaussian perturbations, exponential mechanisms, and a special focus on functional principal components and regression, given their prominence in functional data analysis; (2) Development of privacy mechanisms for modeling and sharing of objects in nonlinear spaces that can be described as Riemannian manifolds. Such data arises naturally when working with 3D images, shapes, covariance matrices, or large scale spatio-temporal data. The manifold structure will be used to develop perturbation methods, especially Gaussian, that produce representative sanitized estimates and data with greater statistical utility; (3) Development of synthetic data mechanisms for samples from infinite dimensional linear spaces or nonlinear manifolds. Synthetic data are becoming increasingly critical for expediting scientific progress while maintaining data privacy. However, producing synthetic data that properly mimics the complex structures described here remains a major open problem. This project represents some of the first work that exploits nonlinear spaces to increase the utility of the resulting sanitized estimates and data.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.