Synthetic Data User Testing and Dissemination

Vilhuber, Lars; Abowd, John

Abstract

Researchers throughout the social, behavioral, economic, and health sciences use data to test hypotheses about a wide range of individual and social behaviors, decisions, and outcomes. Government statistical agencies regularly collect data that are extremely valuable for this purpose. However, these data are not made directly available to the research community because the data providers' (responents') identity is part of the data itself. Therefore statistical agencies and the scientific community have been developing methods to make analytically valid and highly detailed data available to researchers while simultaneously protecting individual privacy.

A particularly valuable and sensitive kind of data is linked administrative data such as the Longitudinal Employer-Household Data (LEHD), the Longitudinal Business Database (LBD) and surveys with linked administrative data (SIPP). These datasets have been constructed with support from statistical agencies and the NSF. The highly detailed nature of these data make them particularly sensitive, and access to the micro-data remains restriced. One approach for balancing the tension between confientiality protection and access is the generation of synthetic data. The process for generating such data begins by estimating a posterior predictive distribution (PPD) of the to-be-released data given the confidential micro-data. The next step is to draw samples from the PPD to produce the released micro-data. The quality of inferences based on a wide variety of models applied to synthetic and actual data has been indaquately assessed to date because only a limited number of users have had access to both data sources. This kind of assessment needs to be integrated within a quality-feedback loop in order to improve synthetic data and increase the use of the data by the research community. This award facilitiates such a feedback loop for synthetic versions of two datasetss: the Census Bureau's Survey of Income and Program Participation and the Longitudinal Business Database. The goal is to broaden access to the data, enhance the feedback loop, and provide flexible and secure access to these synthetic data early releases.

A variety of social scientists from a range of disciplines will be able to use this data access method and will provide detailed input that will guide future improvements in data quality.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Social and Economic Sciences (SES)
Type: Standard Grant (Standard)
Application #: 1042181
Program Officer: Nancy Lutz

Project Start
Project End
Budget Start: 2010-09-15
Budget End: 2015-08-31
Support Year
Fiscal Year: 2010
Total Cost: $252,465
Indirect Cost

Synthetic Data User Testing and Dissemination
Vilhuber, Lars Abowd, John
Cornell University, Ithaca, NY, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments