Statistical agencies and other organizations that disseminate data to the public are ethically and often legally required to protect the confidentiality of respondents' identities and sensitive attributes. To satisfy these requirements, agencies can release multiply-imputed, partially synthetic data. These comprise the units originally surveyed with some values, such as sensitive values at high risk of disclosure or values of key identifiers, replaced with multiple imputations. This research improves the risk-utility profile of partially synthetic data approaches by addressing four key issues in their implementation. First, the research develops methods for quantifying identification disclosure risks for partially synthetic data sets. These measures account for (i) the information existing in all the synthetic data sets, (ii) various assumptions about intruder knowledge and behavior, and (iii) the details released about the synthetic data generation model. This information is crucial to data producers seeking to evaluate the protection afforded by synthetic data. Second, the research provides strategies that data producers can use to select values to synthesize. The strategies optimize the trade-offs between risk and utility for candidate sets of values. Third, the research yields strategies for selecting synthetic data sets. For example, the data producer can throw out synthetic data sets that are too high in disclosure risk or too low in data utility. The research produces guidelines for how such selection impacts inferences made using existing methods, and it develops appropriate methods of inference for situations where the effects of selection are substantial. Finally, the research develops flexible, nonparametric modeling strategies for synthetic data generation based on techniques from machine learning. This improves the analytic validity of partially synthetic data approaches.

This research provides federal agencies, survey organizations, research centers, and other data producers with more and better options for public use data dissemination than exist at present. As resources available to malicious data users continue to expand, the alterations needed to protect public use data with traditional disclosure limitation techniques---such as swapping data values, adding random noise, or aggregating data---may become so extreme that, for many analyses, the released data are no longer useful. Synthetic data, on the other hand, have the potential to enable public use data dissemination while preserving data utility. Ultimately, with higher quality public use data, secondary data analysts can make more and better inferences, leading to deeper understanding of social science and policy questions.

Agency
National Science Foundation (NSF)
Institute
Division of Social and Economic Sciences (SES)
Application #
0751671
Program Officer
Cheryl L. Eavey
Project Start
Project End
Budget Start
2008-06-01
Budget End
2011-05-31
Support Year
Fiscal Year
2007
Total Cost
$180,000
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705