Distributed health data networks (DHDNs) that leverage electronic health records (EHRs) (e.g., eMerge, pSCANNER, PEDSnet) have drawn substantial interests in recent years, as they a) eliminate the need to create, maintain, and secure access to central data repositories, b) minimize the need to disclose protected health information outside the data-owning entity, and c) mitigate many security, proprietary, legal, and privacy concerns. Missing data are ubiquitous and present analytical challenges in DHDNs. However, very limited research has been conducted to address missing data in such settings. When applying to a distributed environment, the current state-of-the-art approaches for handling missing data require pooling raw data into a central repository before analysis and hence require individual-level data sharing, which may not be feasible for a number of reasons, including institutional policies prohibiting such sharing, high regulatory hurdles, public privacy concerns, and costs/overhead of moving massive amounts of data. A large body of research has demonstrated that given some background information about an individual such as data from EHRs, an adversary can learn (from ?de-identified? data) sensitive information about the individual and improper disclosure of individual-level data may have serious implications. The proposed research will address the challenges associated with handling missing data in distributed analysis and fill a crucial methodology gap. We propose the following specific aims: 1) develop privacy-preserving distributed methods for handling missing data in horizontally partitioned data; 2) develop privacy preserving distributed methods for handling missing data in vertically partitioned data; 3) develop a user-friendly toolkit to allow researchers to handle missing data for distributed analysis in health data networks; and 4) evaluate and validate the methods and tool kit using the UCSD obesity patient data prepared for pSCANNER, and data from PEDSnet in addition to simulated data. The proposed approaches will enable using data across multiple sites and will not require pooling patient-level data into a central repository. They can be scaled up to handle massive amounts of data in DHDNs, because the decomposed computation can be parallelized to all participating parties. The results of our study will significantly advance the state-of-the-art in missing data methodology for DHDNs. The privacy-preserving software toolkit will enable researchers to use more complete data in their research by leveraging information from multiple sites without compromising patient privacy, and help lower regulatory and other hurdles for collaboration across multiple institutions and build the public trust. As such, it will encourage more institutions and healthcare systems to become part of a clinical data research network and more patients to participate in clinical studies, which will improve the validity, robustness and generalizability of research findings and offer substantial benefits in areas including, but not limited to, precision medicine and informatics practice.

Public Health Relevance

The goal of this study is to develop privacy-preserving distributed methods and tools for handling missing data in distributed health data networks. The proposed research will enable researchers to use more complete data in their research by leveraging information from multiple sites without compromising patient privacy, and help lower regulatory and other hurdles in collaboration across multiple institutions and build the public trust. It will improve the validity, robustness and generalizability of research findings, and offer substantial benefits in areas including, but not limited to, precision medicine and biomedical informatics practice.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM124111-02
Application #
9562117
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Brazhnik, Paul
Project Start
2017-09-08
Project End
2021-06-30
Budget Start
2018-07-01
Budget End
2019-06-30
Support Year
2
Fiscal Year
2018
Total Cost
Indirect Cost
Name
University of Pennsylvania
Department
Type
Schools of Medicine
DUNS #
042250712
City
Philadelphia
State
PA
Country
United States
Zip Code
19104
Chen, Luyao; Aziz, Md Momin; Mohammed, Noman et al. (2018) Secure large-scale genome data storage and query. Comput Methods Programs Biomed 165:129-137
Min, Eun Jeong; Safo, Sandra E; Long, Qi (2018) Penalized Co-Inertia Analysis with Applications to -Omics Data. Bioinformatics :