When the Human Genome Project was completed almost ten years ago it cost millions of dollars to sequence an individual's genome. Yet, the evolution of high-throughput sequencing and computational tools has been swift and it will soon be possible to genotype anyone for a nominal price. The ability to generate genomic data coincides with the adoption of electronic health records, setting the stage for large-scale personalized medicine research, the results of which can improve the efficiency, effectiveness, and safety of healthcare delivery. To ease barriers to population-based research, genomic and clinical data are often made available via a de- identified designation by various policies and regulations. However, there is a growing perception that de- identification is a fallacy and that biomedical data can be re-identified with relative ease. This argument, which is partially based on our own studies, forms the core of calls for legislative and regulatory modifications in the literature and court cases. Most notably, a recent Advanced Notice of Proposed Rule Making (ANPRM) inquires if biospecimens, as well as derived genomic data, should be redefined as inherently identifiable. Such labeling would require changes to the Common Rule and HIPAA Privacy Rule and could influence the availability of genomic data for research. It is clear that only a small amount of genomic data is necessary to uniquely distinguish an individual, even in the context of aggregated statistics. However, at the same time, it must be recognized that """"""""distinguishable"""""""" is not equivalent to """"""""identifiable"""""""" and though re-identification is possible it des not imply it is probable. Identifiability concerns should not be trivialized, but there is currentl no sound basis for reasoning about such risks, limiting the ability to make informed policy decisions. There are many factors associated with identifiability, including the information shared with genomic data (e.g., clinical, demographic), with whom it is shared, what other sources of data exist, and the relevant legal landscape. A limiting factor of prior studies in genomic identifiability is their consideration of these factors in isolation, which provides an incomplete picture. To fill this void, the overarching objective of our research is to engineer a foundation, rooted in ethical, legal, and computational formalisms, that provides a basis for reasoning about, and managing, genomic data identifiability risks. This foundation will be realized through specific aims: (1) build a protocol for modeling the extent to which sharing genomic data can substantiate re-identification concerns, (2) design and evaluate practical measures of genomic identifiability for risk assessment protocols, (3) develop a strategy that supplies options to mitigate genomic data identification risks. We envision several notable outcomes from this project. First, this work will yield guidelines and risk assessment strategies that can be employed by genomic data managers and policy makers to inform their decisions regarding identifiability. Second, we will perform an evaluation of our framework with a real, large de-identified database of clinical and genomic data to provide tangible and pragmatic results.

Public Health Relevance

The protective nature of de-identification has been criticized and there are growing calls to relabel all genomic data as inherently identifiable. However, there are no reasoning tools to assist genomic data managers and policy makers to assess identifiability or determine which protections, technical or legal, should be invoked to mitigate risks. The goals of this research project are to develop an interdisciplinary framework to a) model genomic data re-identification risks, b) measure the risks given computational and socio-legal constraints, and c) assist in determining which data protection strategies are the most appropriate to specific data sharing scenarios.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (SEIR)
Program Officer
Mcewen, Jean
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Vanderbilt University Medical Center
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Xia, Weiyi; Wan, Zhiyu; Yin, Zhijun et al. (2018) It's all in the timing: calibrating temporal penalties for biomedical data sharing. J Am Med Inform Assoc 25:25-31
Wan, Zhiyu; Vorobeychik, Yevgeniy; Kantarcioglu, Murat et al. (2017) Controlling the signal: Practical privacy protection of genomic data sharing through Beacon services. BMC Med Genomics 10:39
Wang, Shuang; Jiang, Xiaoqian; Tang, Haixu et al. (2017) A community effort to protect genomic data sharing, collaboration and outsourcing. NPJ Genom Med 2:33
Prasser, Fabian; Gaupp, James; Wan, Zhiyu et al. (2017) An Open Source Tool for Game Theoretic Health Data De-Identification. AMIA Annu Symp Proc 2017:1430-1439
Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun et al. (2017) Scalable Iterative Classification for Sanitizing Large-Scale Datasets. IEEE Trans Knowl Data Eng 29:698-711
Wan, Zhiyu; Vorobeychik, Yevgeniy; Xia, Weiyi et al. (2017) Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach. Am J Hum Genet 100:316-322
Yuan, Jiawei; Malin, Bradley; Modave, Fran├žois et al. (2017) Towards a privacy preserving cohort discovery framework for clinical research networks. J Biomed Inform 66:42-51
Heatherly, Raymond; Rasmussen, Luke V; Peissig, Peggy L et al. (2016) A multi-institution evaluation of clinical profile anonymization. J Am Med Inform Assoc 23:e131-7
Xia, Weiyi; Heatherly, Raymond; Ding, Xiaofeng et al. (2015) R-U policy frontiers for health data de-identification. J Am Med Inform Assoc 22:1029-41
Naveed, Muhammad; Ayday, Erman; Clayton, Ellen W et al. (2015) Privacy in the Genomic Era. ACM Comput Surv 48:

Showing the most recent 10 out of 18 publications