When the Human Genome Project was completed almost ten years ago it cost millions of dollars to sequence an individual's genome. Yet, the evolution of high-throughput sequencing and computational tools has been swift and it will soon be possible to genotype anyone for a nominal price. The ability to generate genomic data coincides with the adoption of electronic health records, setting the stage for large-scale personalized medicine research, the results of which can improve the efficiency, effectiveness, and safety of healthcare delivery. To ease barriers to population-based research, genomic and clinical data are often made available via a de- identified designation by various policies and regulations. However, there is a growing perception that de- identification is a fallacy and that biomedical data can be re-identified with relative ease. This argument, which is partially based on our own studies, forms the core of calls for legislative and regulatory modifications in the literature and court cases. Most notably, a recent Advanced Notice of Proposed Rule Making (ANPRM) inquires if biospecimens, as well as derived genomic data, should be redefined as inherently identifiable. Such labeling would require changes to the Common Rule and HIPAA Privacy Rule and could influence the availability of genomic data for research. It is clear that only a small amount of genomic data is necessary to uniquely distinguish an individual, even in the context of aggregated statistics. However, at the same time, it must be recognized that "distinguishable" is not equivalent to "identifiable" and though re-identification is possible it des not imply it is probable. Identifiability concerns should not be trivialized, but there is currentl no sound basis for reasoning about such risks, limiting the ability to make informed policy decisions. There are many factors associated with identifiability, including the information shared with genomic data (e.g., clinical, demographic), with whom it is shared, what other sources of data exist, and the relevant legal landscape. A limiting factor of prior studies in genomic identifiability is their consideration of these factors in isolation, which provides an incomplete picture. To fill this void, the overarching objective of our research is to engineer a foundation, rooted in ethical, legal, and computational formalisms, that provides a basis for reasoning about, and managing, genomic data identifiability risks. This foundation will be realized through specific aims: (1) build a protocol for modeling the extent to which sharing genomic data can substantiate re-identification concerns, (2) design and evaluate practical measures of genomic identifiability for risk assessment protocols, (3) develop a strategy that supplies options to mitigate genomic data identification risks. We envision several notable outcomes from this project. First, this work will yield guidelines and risk assessment strategies that can be employed by genomic data managers and policy makers to inform their decisions regarding identifiability. Second, we will perform an evaluation of our framework with a real, large de-identified database of clinical and genomic data to provide tangible and pragmatic results.

Public Health Relevance

The protective nature of de-identification has been criticized and there are growing calls to relabel all genomic data as inherently identifiable. However, there are no reasoning tools to assist genomic data managers and policy makers to assess identifiability or determine which protections, technical or legal, should be invoked to mitigate risks. The goals of this research project are to develop an interdisciplinary framework to a) model genomic data re-identification risks, b) measure the risks given computational and socio-legal constraints, and c) assist in determining which data protection strategies are the most appropriate to specific data sharing scenarios.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006844-02
Application #
8548389
Study Section
Special Emphasis Panel (SEIR)
Program Officer
Mcewen, Jean
Project Start
2012-09-21
Project End
2016-06-30
Budget Start
2013-07-01
Budget End
2014-06-30
Support Year
2
Fiscal Year
2013
Total Cost
$334,251
Indirect Cost
$92,185
Name
Vanderbilt University Medical Center
Department
Internal Medicine/Medicine
Type
Schools of Medicine
DUNS #
004413456
City
Nashville
State
TN
Country
United States
Zip Code
37212
Altman, Russ B; Clayton, Ellen Wright; Kohane, Isaac S et al. (2013) Data re-identification: societal safeguards. Science 339:1032-3
Hazin, Ribhi; Brothers, Kyle B; Malin, Bradley A et al. (2013) Ethical, legal, and social implications of incorporating genomic information into electronic health records. Genet Med 15:810-6