Geographic data can be enormously beneficial for analyses. In studies of aging, for example, they can reveal areas where elderly people live in high densities;they can illuminate how environmental factors impact the health and quality of life of elderly people;and, through contextual data, they can yield insights into the social and economic conditions and lifestyle choices of the elderly. However, geographic variables are among the most challenging data to share when making a primary data source available to others. Fine geography enables ill-intentioned users to pinpoint the identities of individuals in the shared file. Thus, data collectors typically delete or aggregate geographies to very high levels before sharing data. As examples, both deletion and aggregation are employed on geography in the public use files of the Health and Retirement Study;and, the Health Insurance Portability and Accountability Act requires that any geographic units on shared files comprise at least 20,000 people. These actions reduce the quality of analyses based on finer geographic detail, thereby sacrificing the benefits of using geography in analysis. We develop new methods to protect confidentiality in data with geographic identifiers. Our approach is to simulate values of geography and other identifying attributes, such as age, from statistical models that capture the spatial dependencies in the collected data. These simulated values replace the collected ones when sharing data. Partially simulated datasets can preserve confidentiality, since identification of units and their sensitive data is difficult when the geographies and other quasi-identifiers in the released data are not collected values. And, when the simulation models faithfully reflect the relationships in the collected data, the shared data preserve spatial associations, avoid ecological inference problems, and provide details about the tails of distributions. We have three specific aims in this proposal. First, using techniques from spatial modeling, we develop methods for simulating geographic variables conditional on attributes and for simulating at- tributes conditional on geography. Second, we apply our approach on a genuine dataset to evaluate the confidentiality protection and analytic utility of partially simulated data under three scenarios: only geography simulated, only non-geographic identifiers simulated, and both geographic and other identifiers simulated. Third, we compare our approach against aggregation techniques on the genuine dataset. Our long term goal is to develop general-purpose methodology and publicly available software for sharing inference-valid, safe data that includes finer details about geography than are currently released. This will provide statistical agencies, researchers, and other data producers with more and better options for data sharing than exist at present.

Public Health Relevance

This research has the potential to improve the way statistical agencies, research centers, individual researchers, and other data producers share data on aging, and more broadly any health or de- mographic data containing geography. Unlike existing approaches such as deletion and high level aggregation, our approach promises to preserve fine geography and spatial relationships while pro- tecting confidentiality. Ultimately, this enables secondary data analysts to make more and better inferences, leading to deeper understanding of public health.

Agency
National Institute of Health (NIH)
Institute
National Institute on Aging (NIA)
Type
Exploratory/Developmental Grants (R21)
Project #
5R21AG032458-02
Application #
7774323
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Bhattacharyya, Partha
Project Start
2009-03-01
Project End
2012-01-31
Budget Start
2010-02-15
Budget End
2012-01-31
Support Year
2
Fiscal Year
2010
Total Cost
$189,961
Indirect Cost
Name
Duke University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
044387793
City
Durham
State
NC
Country
United States
Zip Code
27705
Paiva, Thais; Chakraborty, Avishek; Reiter, Jerry et al. (2014) Imputation of confidential data sets with spatial locations using disease mapping models. Stat Med 33:1928-45
Burgette, Lane F; Reiter, Jerome P (2013) Multiple-Shrinkage Multinomial Probit Models with Applications to Simulating Geographies in Public Use Data. Bayesian Anal 8:
Wang, Hao; Reiter, Jerome P (2012) MULTIPLE IMPUTATION FOR SHARING PRECISE GEOGRAPHIES IN PUBLIC USE DATA. Ann Appl Stat 6:229-252
Manrique-Vallier, Daniel; Reiter, Jerome P (2012) Estimating Identification Disclosure Risk Using Mixed Membership Models. J Am Stat Assoc 107:1385-1394