This research project will develop a pilot of an integrated system for disseminating large-scale data about people. This project will address critical challenges that have inhibited the wide-spread dissemination of large-scale databases that can advance basic social, behavioral, and economic science research and that offer enormous potential benefits to society. Among the challenges the dissemination of these data have posed are the unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises and sometimes laws designed to protect data subjects' privacy and confidentiality. The products of this project will facilitate the development and dissemination of safe and useful large-scale datasets. The project will result in extensible and open-source products that constitute a proof of concept and that will provide valuable information for future larger-scale implementations of the system. The project therefore will lay the groundwork for a potential transformation in data dissemination, providing data stewards with the infrastructure they need to release data products that advance social science, policy making, and training. The project also will provide education and training opportunities for a post-doctoral researcher as well as graduate and undergraduate students.
The investigators will create new methodology and broadly applicable tools for meeting data dissemination challenges. From a technical perspective, they will advance methodology for generating synthetic datasets via nonparametric methods capable of handling highly dimensional data. They will advance methodology for providing feedback on the quality of inferences from heavily redacted data, and they will develop methods for in depth assessment and characterization of disclosure risks inherent in releasing large-scale synthetic data with and without verification servers. From an infrastructure perspective, the investigators will develop systems and architecture for integrating the three core tools (synthetic data, verification servers, and remote access) in ways that result in secure, scalable access to data. The pilot system will be built with the goal of disseminating a version of a dataset on the work histories of federal government employees.