This CAREER award will develop methods for generating high-quality, spatially referenced public-use data while addressing data confidentiality concerns. Access to high-quality public-use data is critical for many research disciplines. However, analyses of fine-scale geographic regions with small population sizes (e.g., census tracts) often yield statistically unreliable inference. Small areas also may contain few study participants, thus increasing the risk of disclosure of sensitive information about a participant, such as an individual's disease or employment status. This project will create a unifying framework between the formal privacy literature and the spatial statistics literature that gives equal weight to privacy considerations and the utility of the resulting data. The results of this research will be of value both to academic researchers and staff at the Federal statistical agencies. The investigator will collaborate with researchers at the Centers of Disease Control and Prevention and the National Center for Health Statistics. Workshops and short courses will be developed by the investigator on spatial statistics and data privacy for staff at the Federal statistical agencies. The project also will create undergraduate research opportunities in Bayesian inference and statistical computing and provide educational opportunities related to spatial statistics and data privacy.

This project will develop Bayesian statistical methods for generating spatially referenced synthetic data that achieve or exceed the privacy protections currently implemented by U.S. Federal statistical agencies. Small area estimation methods from the spatial statistics literature provide a framework to leverage complex dependencies in the data to improve the precision of an estimate. Emerging methods from the data privacy literature may be used to mask or otherwise conceal information from these areas to protect the privacy guarantees made to the data subjects in exchange for their participation. Taken together, these two approaches present an analytic tension between providing accurate and reliable local estimates and the need to obscure detailed linkage between small area estimates and the data subjects residing therein. This project will tackle the following issues. First, the project will devise a statistical framework for producing massive, differentially private public-use data repositories comprised of spatially referenced synthetic aggregate count data. A key aspect of this work will be to strike a balance between computational efficiency and data utility. Second, the project will establish criteria for synthetic data from a broad class of spatial models to satisfy formal privacy protections. The result of this work will be methods that provide substantial gains in utility and help combine the tasks of data analysis and the generation of synthetic data to avoid redundancies.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Social and Economic Sciences (SES)
Application #
1943730
Program Officer
Cheryl Eavey
Project Start
Project End
Budget Start
2020-05-01
Budget End
2025-04-30
Support Year
Fiscal Year
2019
Total Cost
$250,000
Indirect Cost
Name
Drexel University
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19102