The past decade has witnessed numerous demonstrations that genomic data can be traced back to the corresponding named individuals. These attacks exploit various collections, including the NIH Database of Genotypes and Phenotypes (dbGaP), the 1000 Genomes Project, and the Beacon Project of the Global Alliance for Genomics and Health, and are often reported in the popular media. At the same time, research conducted in the first phase of this grant (from 2012-2016) showed that such re-identification attacks often represent worst- case, non-generalizable scenarios. Specifically, it was shown that these attacks often focus on the possibility of attack - and not its probability given the wide range of factors often at play in practice. By focusing on the possible, such investigations can lead policy makers to believe that de-identification is a useless activity. However, our research showed that de-identification is only one part of a larger strategy of deterrents that can be used to manage risk. By intelligently combining de-identification with other technical risk mitigation approaches (e.g., controlled access) and societal constructs (e.g., data use agreements and penalties), genomic data sharing solutions can be developed with appropriate levels of risk and utility for scientists and society. While our research laid the foundation for managing identification risk in genomic data sharing, significant questions remain regarding its translation into practical guidance. In particular, risk management models must be specialized to the type of data that is shared, the types of penalties (or punishments) available, and the costs of adopting and administering deterrence mechanisms. Thus, in the second phase of this research project, we propose to augment risk-based re-identification management frameworks to model and assess the deterrence approaches invoked by existing repositories, such as dbGaP (which holds a collection of smaller historical datasets from completed studies), as well as emerging initiatives, such as the Precision Medicine Initiative. This project will pursue three specific aims, designed to work in harmony, but at the same time sufficiently independent that if one fails, the research will still yield fruitful risk management guidance for genomic databases: 1) Develop game theoretic models to assess re-identification attacks at different levels of detail in genomic data sharing (e.g., aggregate summaries of the proportion of variants in case vs. control groups in association studies); 2) Characterize and measure the costs associated with common re-identification deterrence approaches for genomic data (e.g., physical investigatory reviews and virtual audits of IT system use); and 3) Optimize the parameterization of a deterrence policy (e.g., the amount of damages for violation of a data use agreement or the amount of time to withhold data from an attacker/investigator) given the expected value of genomic data. We will evaluate these approaches with a large repository of de-identified genomic and electronic medical records in use at a large academic medical center, datasets hosted at two federal repositories, and a web system that presents summary statistics from a cohort of 9000 participants.

Public Health Relevance

Numerous demonstrations over the past decade have shown that purportedly anonymous genomic data can on occasion be linked to the individuals to whom they correspond. Yet, even if an attack is possible, it need not be probable, and emerging risk management approaches reveal that technical (e.g., data aggregation and auditing) and social (e.g., data use agreements and penalties) controls can be combined to lower risk to acceptable levels while maintaining scientific value of the data. To enable pragmatic socio-technical risk management, we propose to augment game theoretic models: 1) to quantify risk of re-identification attacks with varying levels of data aggregation and adversarial knowledge, 2) to measure the costs associated with managing deterrents to thwart attacks, and 3) to optimize the parameters of deterrence policies given the expected value of genomic data.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-SEIR-R (01)Q)
Program Officer
Mcewen, Jean
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Vanderbilt University Medical Center
United States
Zip Code
Xia, Weiyi; Wan, Zhiyu; Yin, Zhijun et al. (2018) It's all in the timing: calibrating temporal penalties for biomedical data sharing. J Am Med Inform Assoc 25:25-31
Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun et al. (2017) Scalable Iterative Classification for Sanitizing Large-Scale Datasets. IEEE Trans Knowl Data Eng 29:698-711
Wan, Zhiyu; Vorobeychik, Yevgeniy; Xia, Weiyi et al. (2017) Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach. Am J Hum Genet 100:316-322
Yuan, Jiawei; Malin, Bradley; Modave, Fran├žois et al. (2017) Towards a privacy preserving cohort discovery framework for clinical research networks. J Biomed Inform 66:42-51
Wan, Zhiyu; Vorobeychik, Yevgeniy; Kantarcioglu, Murat et al. (2017) Controlling the signal: Practical privacy protection of genomic data sharing through Beacon services. BMC Med Genomics 10:39
Wang, Shuang; Jiang, Xiaoqian; Tang, Haixu et al. (2017) A community effort to protect genomic data sharing, collaboration and outsourcing. NPJ Genom Med 2:33
Prasser, Fabian; Gaupp, James; Wan, Zhiyu et al. (2017) An Open Source Tool for Game Theoretic Health Data De-Identification. AMIA Annu Symp Proc 2017:1430-1439
Heatherly, Raymond; Rasmussen, Luke V; Peissig, Peggy L et al. (2016) A multi-institution evaluation of clinical profile anonymization. J Am Med Inform Assoc 23:e131-7
Xia, Weiyi; Heatherly, Raymond; Ding, Xiaofeng et al. (2015) R-U policy frontiers for health data de-identification. J Am Med Inform Assoc 22:1029-41
Naveed, Muhammad; Ayday, Erman; Clayton, Ellen W et al. (2015) Privacy in the Genomic Era. ACM Comput Surv 48:

Showing the most recent 10 out of 18 publications