When the Human Genome Project was completed almost ten years ago it cost millions of dollars to sequence an individual's genome. Yet, the evolution of high-throughput sequencing and computational tools has been swift and it will soon be possible to genotype anyone for a nominal price. The ability to generate genomic data coincides with the adoption of electronic health records, setting the stage for large-scale personalized medicine research, the results of which can improve the efficiency, effectiveness, and safety of healthcare delivery. To ease barriers to population-based research, genomic and clinical data are often made available via a de- identified designation by various policies and regulations. However, there is a growing perception that de- identification is a fallacy and that biomedical data can be re-identified with relative ease. This argument, which is partially based on our own studies, forms the core of calls for legislative and regulatory modifications in the literature and court cases. Most notably, a recent Advanced Notice of Proposed Rule Making (ANPRM) inquires if biospecimens, as well as derived genomic data, should be redefined as inherently identifiable. Such labeling would require changes to the Common Rule and HIPAA Privacy Rule and could influence the availability of genomic data for research. It is clear that only a small amount of genomic data is necessary to uniquely distinguish an individual, even in the context of aggregated statistics. However, at the same time, it must be recognized that """"""""distinguishable"""""""" is not equivalent to """"""""identifiable"""""""" and though re-identification is possible it des not imply it is probable. Identifiability concerns should not be trivialized, but there is currentl no sound basis for reasoning about such risks, limiting the ability to make informed policy decisions. There are many factors associated with identifiability, including the information shared with genomic data (e.g., clinical, demographic), with whom it is shared, what other sources of data exist, and the relevant legal landscape. A limiting factor of prior studies in genomic identifiability is their consideration of these factors in isolation, which provides an incomplete picture. To fill this void, the overarching objective of our research is to engineer a foundation, rooted in ethical, legal, and computational formalisms, that provides a basis for reasoning about, and managing, genomic data identifiability risks. This foundation will be realized through specific aims: (1) build a protocol for modeling the extent to which sharing genomic data can substantiate re-identification concerns, (2) design and evaluate practical measures of genomic identifiability for risk assessment protocols, (3) develop a strategy that supplies options to mitigate genomic data identification risks. We envision several notable outcomes from this project. First, this work will yield guidelines and risk assessment strategies that can be employed by genomic data managers and policy makers to inform their decisions regarding identifiability. Second, we will perform an evaluation of our framework with a real, large de-identified database of clinical and genomic data to provide tangible and pragmatic results.
The protective nature of de-identification has been criticized and there are growing calls to relabel all genomic data as inherently identifiable. However, there are no reasoning tools to assist genomic data managers and policy makers to assess identifiability or determine which protections, technical or legal, should be invoked to mitigate risks. The goals of this research project are to develop an interdisciplinary framework to a) model genomic data re-identification risks, b) measure the risks given computational and socio-legal constraints, and c) assist in determining which data protection strategies are the most appropriate to specific data sharing scenarios.
|Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun et al. (2017) Scalable Iterative Classification for Sanitizing Large-Scale Datasets. IEEE Trans Knowl Data Eng 29:698-711|
|Wan, Zhiyu; Vorobeychik, Yevgeniy; Xia, Weiyi et al. (2017) Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach. Am J Hum Genet 100:316-322|
|Heatherly, Raymond; Rasmussen, Luke V; Peissig, Peggy L et al. (2016) A multi-institution evaluation of clinical profile anonymization. J Am Med Inform Assoc 23:e131-7|
|Xia, Weiyi; Heatherly, Raymond; Ding, Xiaofeng et al. (2015) R-U policy frontiers for health data de-identification. J Am Med Inform Assoc 22:1029-41|
|El Emam, Khaled; Rodgers, Sam; Malin, Bradley (2015) Anonymising and sharing individual patient data. BMJ 350:h1139|
|Naveed, Muhammad; Ayday, Erman; Clayton, Ellen W et al. (2015) Privacy in the Genomic Era. ACM Comput Surv 48:|
|Barth-Jones, Daniel; El Emam, Khaled; Bambauer, Jane et al. (2015) Assessing data intrusion threats. Science 348:194-5|
|Wan, Zhiyu; Vorobeychik, Yevgeniy; Xia, Weiyi et al. (2015) A game theoretic framework for analyzing re-identification risk. PLoS One 10:e0120592|
|Heatherly, Raymond; Denny, Joshua C; Haines, Jonathan L et al. (2014) Size matters: how population size influences genotype-phenotype association studies in anonymized data. J Biomed Inform 52:243-50|
|Heatherly, Raymond D; Loukides, Grigorios; Denny, Joshua C et al. (2013) Enabling genomic-phenomic association discovery without sacrificing anonymity. PLoS One 8:e53875|
Showing the most recent 10 out of 13 publications