Privacy-preserving methods and tools for handling missing data in distributed health data networks

Long, Qi

Abstract

Distributed health data networks (DHDNs) that leverage electronic health records (EHRs) (e.g., eMerge, pSCANNER, PEDSnet) have drawn substantial interests in recent years, as they a) eliminate the need to create, maintain, and secure access to central data repositories, b) minimize the need to disclose protected health information outside the data-owning entity, and c) mitigate many security, proprietary, legal, and privacy concerns. Missing data are ubiquitous and present analytical challenges in DHDNs. However, very limited research has been conducted to address missing data in such settings. When applying to a distributed environment, the current state-of-the-art approaches for handling missing data require pooling raw data into a central repository before analysis and hence require individual-level data sharing, which may not be feasible for a number of reasons, including institutional policies prohibiting such sharing, high regulatory hurdles, public privacy concerns, and costs/overhead of moving massive amounts of data. A large body of research has demonstrated that given some background information about an individual such as data from EHRs, an adversary can learn (from ?de-identified? data) sensitive information about the individual and improper disclosure of individual-level data may have serious implications. The proposed research will address the challenges associated with handling missing data in distributed analysis and fill a crucial methodology gap. We propose the following specific aims: 1) develop privacy-preserving distributed methods for handling missing data in horizontally partitioned data; 2) develop privacy preserving distributed methods for handling missing data in vertically partitioned data; 3) develop a user-friendly toolkit to allow researchers to handle missing data for distributed analysis in health data networks; and 4) evaluate and validate the methods and tool kit using the UCSD obesity patient data prepared for pSCANNER, and data from PEDSnet in addition to simulated data. The proposed approaches will enable using data across multiple sites and will not require pooling patient-level data into a central repository. They can be scaled up to handle massive amounts of data in DHDNs, because the decomposed computation can be parallelized to all participating parties. The results of our study will significantly advance the state-of-the-art in missing data methodology for DHDNs. The privacy-preserving software toolkit will enable researchers to use more complete data in their research by leveraging information from multiple sites without compromising patient privacy, and help lower regulatory and other hurdles for collaboration across multiple institutions and build the public trust. As such, it will encourage more institutions and healthcare systems to become part of a clinical data research network and more patients to participate in clinical studies, which will improve the validity, robustness and generalizability of research findings and offer substantial benefits in areas including, but not limited to, precision medicine and informatics practice.

Public Health Relevance

The goal of this study is to develop privacy-preserving distributed methods and tools for handling missing data in distributed health data networks. The proposed research will enable researchers to use more complete data in their research by leveraging information from multiple sites without compromising patient privacy, and help lower regulatory and other hurdles in collaboration across multiple institutions and build the public trust. It will improve the validity, robustness and generalizability of research findings, and offer substantial benefits in areas including, but not limited to, precision medicine and biomedical informatics practice.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 5R01GM124111-02
Application #: 9562117
Study Section: Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer: Brazhnik, Paul

Project Start: 2017-09-08
Project End: 2021-06-30
Budget Start: 2018-07-01
Budget End: 2019-06-30
Support Year: 2
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: University of Pennsylvania
Department
Type: Schools of Medicine
DUNS #: 042250712

City: Philadelphia
State: PA
Country: United States
Zip Code: 19104

Related projects


NIH 2020 R01 GM	Privacy-preserving methods and tools for handling missing data in distributed health data networks Long, Qi / University of Pennsylvania
NIH 2019 R01 GM	Privacy-preserving methods and tools for handling missing data in distributed health data networks Long, Qi / University of Pennsylvania
NIH 2018 R01 GM	Privacy-preserving methods and tools for handling missing data in distributed health data networks Long, Qi / University of Pennsylvania
NIH 2017 R01 GM	Privacy-preserving methods and tools for handling missing data in distributed health data networks Long, Qi / University of Pennsylvania

Publications

Chen, Luyao; Aziz, Md Momin; Mohammed, Noman et al. (2018) Secure large-scale genome data storage and query. Comput Methods Programs Biomed 165:129-137

Min, Eun Jeong; Safo, Sandra E; Long, Qi (2018) Penalized Co-Inertia Analysis with Applications to -Omics Data. Bioinformatics :

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: