Synthetic Data Generation for Small Area Estimation

Sakshaug, Joe

Abstract

Sample surveys are a crucial source of information about the state of public health and people's quality of life. Moreover, they provide an efficient way to identify and monitor illness and disability trends and track progress toward achieving CDC's Health Protection Goals. Increasingly, this information is being demanded in the form of small area statistics to monitor health trends and support policy decisions in small geographic areas, including those that are typically underrepresented in large-scale data collection projects. However, the CDC is often prevented from releasing small area identifiers in public- use datasets because the data do not satisfy certain disclosure restrictions described in the Public Health Service Act (Section 308(d)), which forbids the disclosure of any information that may compromise the confidentiality promised to its survey respondents. This dissertation research tests and evaluates a new method for generating public-use micro-level datasets that contain enough geographical detail to permit small area estimation without compromising the confidentiality of survey respondents. The method uses the observed survey data to fit a statistical imputation model that generates artificial, or synthetic, data records, which comprise the public-use data records. The synthetic data is generated to emulate the observed data and preserve important statistical properties of the observed data. Moreover, the synthetic data can account for the hierarchical clustering structure associated with multiple levels of geography;thus, permitting data users to perform various geographical analyses with a single dataset. Confidentiality protection is greatly enhanced because no actual data values are released to the public. The proposed methodology will be tested and evaluated using data from the National Health Interview Survey (NHIS) and the Behavioral Risk Factor Surveillance System (BRFSS). Synthetic versions of these data sources will be generated for key variables relevant to national health objectives. Various parametric and non-parametric imputation models capable of handling different variable types will be investigated. All of this work will be conducted at the Michigan Census Research Data Center at the University of Michigan.

Public Health Relevance

The proposed research aims to maximize access and increase the utility of public health survey data for the purposes of identifying and monitoring health trends in small geographic areas, including those that are typically underrepresented in large-scale data collection projects. Without robust survey data at small geographic levels, public health researchers and other health- care professionals cannot successfully answer pressing public health questions that impact small areas and local communities. The potential impact of this research is an increase in the sheer volume of small area statistics produced and used to assess community-level health care needs and improve the quality of life in places where people live and work.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Center for Health Statistics (NCHS)
Type: Dissertation Award (R36)
Project #: 1R36SH000016-01
Application #: 7769003
Study Section: Special Emphasis Panel (ZCD1-AWI (10))

Project Start: 2009-09-30
Project End: 2011-09-29
Budget Start: 2009-09-30
Budget End: 2011-09-29
Support Year: 1
Fiscal Year: 2009
Total Cost: $37,800
Indirect Cost

Synthetic Data Generation for Small Area Estimation
Sakshaug, Joe Walter
University of Michigan Ann Arbor, Ann Arbor, MI, United States

Abstract

Public Health Relevance

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Public Health Relevance

Funding Agency

Institution

Comments