This research project will develop a pilot of an integrated system for disseminating large-scale data about people. This project will address critical challenges that have inhibited the wide-spread dissemination of large-scale databases that can advance basic social, behavioral, and economic science research and that offer enormous potential benefits to society. Among the challenges the dissemination of these data have posed are the unintended disclosures of data subjects' identities and sensitive attributes, thereby violating promises and sometimes laws designed to protect data subjects' privacy and confidentiality. The products of this project will facilitate the development and dissemination of safe and useful large-scale datasets. The project will result in extensible and open-source products that constitute a proof of concept and that will provide valuable information for future larger-scale implementations of the system. The project therefore will lay the groundwork for a potential transformation in data dissemination, providing data stewards with the infrastructure they need to release data products that advance social science, policy making, and training. The project also will provide education and training opportunities for a post-doctoral researcher as well as graduate and undergraduate students.

The investigators will create new methodology and broadly applicable tools for meeting data dissemination challenges. From a technical perspective, they will advance methodology for generating synthetic datasets via nonparametric methods capable of handling highly dimensional data. They will advance methodology for providing feedback on the quality of inferences from heavily redacted data, and they will develop methods for in depth assessment and characterization of disclosure risks inherent in releasing large-scale synthetic data with and without verification servers. From an infrastructure perspective, the investigators will develop systems and architecture for integrating the three core tools (synthetic data, verification servers, and remote access) in ways that result in secure, scalable access to data. The pilot system will be built with the goal of disseminating a version of a dataset on the work histories of federal government employees.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
1443014
Program Officer
Amy Walton
Project Start
Project End
Budget Start
2015-01-01
Budget End
2018-12-31
Support Year
Fiscal Year
2014
Total Cost
$1,498,683
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705