The biomedical community is in the midst of a genomics revolution with the potential to personalize healthcare services to a patient's genome. To capitalize on recent genomics programs, scientists have initiated research to discover relationships between an individual's genomic variations and clinical phenotype. Most of the gathering and analysis of person-specific records has been localized to particular investigators or institutions;however, scientists need to share data collections to strengthen the statistical power of association tests, to allow others the opportunity to verify their analyses, and to comply with policy requirements. To facilitate this process, various organizations around the world are significantly investing in databanks to consolidate patient- specific records from disparate investigators. The availability of such databanks for wide-spread use is contingent on protecting the anonymity of the individuals that correspond to the shared records. Though policy and technical approaches for biomedical records privacy exist, they are inappropriate for environments that consolidate records from multiple organizations. In particular, various investigations demonstrate that the simple de-identification of person-specific biomedical records leave centralized records vulnerable to """"""""re- identification"""""""" through public resources. The overarching goal of our research is to develop a novel data protection model for centralized person-specific biomedical records based on formal privacy and security methods. Our solution will be composed of a suite of technologies, each of which addresses a challenge for the construction, and use, of biomedical databanks. These technologies will be developed in three specific aims: (1) build a tool to integrate research participants'biomedical records from disparate organizations without compromising participants'anonymity, (2) construct methods to securely collect, store, and analyze biomedical data without revealing individual records, and (3) detect and prevent policy violations that can arise as a consequence of investigators queries to the databank. Our methods will be implemented in software that shields scientists and administrators from handling the technical details of unfamiliar privacy and security protocols. The final product will be software that enables disparate data holders to submit information to a centralized biomedical databank, scientists to analyze the stored records, and administrators to monitor the use of system for privacy violations. The software will be designed in a modular and configurable manner, thus enabling users to pick and choose which protection features are most appropriate to their environment. To demonstrate the applicability of our methodology, this research will specifically address a real world data privacy challenge that is a bottleneck for multi-institutional genome wide association studies, but the resulting models and software will be reusable for other centralized databanking environments. We believe that by managing biomedical records through formal privacy protection mechanisms, databanks based on our model will be able to support research with greater throughput than the current status quo.

Public Health Relevance

To ensure wide-scale sharing of person-specific biomedical records for research purposes, it is necessary to build technologies that uphold the anonymity of the corresponding individuals without diminishing the usability of the records. In this research, we will focus on developing technologies to support privacy in an emerging phenomenon: biomedical databanks that centralize records from disparate data providers. The goals of this project are to develop techniques, implemented in open-source software, that a) merge biomedical records on the same subjects without revealing the subjects'identities, b) collect, store, and analyze biomedical records in a secure manner and without revealing individual records, and c) detect and mitigate policy violations while investigators interact with records stored in the databank.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Vanderbilt University Medical Center
Internal Medicine/Medicine
Schools of Medicine
United States
Zip Code
Yuan, Jiawei; Malin, Bradley; Modave, Fran├žois et al. (2017) Towards a privacy preserving cohort discovery framework for clinical research networks. J Biomed Inform 66:42-51
Li, Bo; Vorobeychik, Yevgeniy; Li, Muqun et al. (2017) Scalable Iterative Classification for Sanitizing Large-Scale Datasets. IEEE Trans Knowl Data Eng 29:698-711
Wan, Zhiyu; Vorobeychik, Yevgeniy; Xia, Weiyi et al. (2017) Expanding Access to Large-Scale Genomic Data While Promoting Privacy: A Game Theoretic Approach. Am J Hum Genet 100:316-322
Li, Muqun; Carrell, David; Aberdeen, John et al. (2016) Optimizing annotation resources for natural language de-identification via a game theoretic framework. J Biomed Inform 61:97-109
Heatherly, Raymond; Rasmussen, Luke V; Peissig, Peggy L et al. (2016) A multi-institution evaluation of clinical profile anonymization. J Am Med Inform Assoc 23:e131-7
Barth-Jones, Daniel; El Emam, Khaled; Bambauer, Jane et al. (2015) Assessing data intrusion threats. Science 348:194-5
Wan, Zhiyu; Vorobeychik, Yevgeniy; Xia, Weiyi et al. (2015) A game theoretic framework for analyzing re-identification risk. PLoS One 10:e0120592
Kho, Abel N; Cashy, John P; Jackson, Kathryn L et al. (2015) Design and implementation of a privacy preserving electronic health record linkage tool in Chicago. J Am Med Inform Assoc 22:1072-80
Xia, Weiyi; Heatherly, Raymond; Ding, Xiaofeng et al. (2015) R-U policy frontiers for health data de-identification. J Am Med Inform Assoc 22:1029-41
El Emam, Khaled; Rodgers, Sam; Malin, Bradley (2015) Anonymising and sharing individual patient data. BMJ 350:h1139

Showing the most recent 10 out of 43 publications