Advances in technology, bioinformatics, and data science have made it possible to analyze large and complex databases to generate evidence that improves public health and accelerates the development of precision medicine. However, the advent of big data has also raised concerns about privacy and confidentiality. This application is focused on data privacy in vertically partitioned data, a data environment where information about an individual is available in two or more data sources. This type of data structure is common in biomedical research and is expected to grow exponentially as information from the same individual is increasingly collected in multiple sources, such as insurance claims databases, electronic health records, registries, social media, wearables, and mobile devices. Combining multiple databases provides a more complete health profile about the patient and generates more robust evidence. However, concerns about data privacy, confidentiality, and security, and constraints in governance and institutional agreements make it highly challenging or sometimes impossible to physically pool different data sources. We propose to develop an open-source, freely available software tool that will employ a cutting-edge method ? distributed regression ? to analyze vertically partitioned datasets. The method does not require data to be combined physically, but produces statistically equivalent results as if the datasets were linked and pooled centrally at one site. Instead of sharing patient-level information, participating sites will only transfer non-identifiable information matrix (a design matrix used in fitting of statistical models) and other summary-level statistics needed in the statistical modeling process. This approach offers much greater protection for data privacy while allowing one to perform sophisticated statistical analysis. The software tool will be developed, tested, and fine-tuned using both simulated datasets and the real-world data from Optum Labs, which houses one of the largest vertically partitioned datasets in the U.S. with claims and electronic health record data from over 5 million patients. The tool will be made compatible with PopMedNetTM, an open-source data-sharing platform currently used by several large national initiatives such as the NIH Health Care Systems Research Collaboratory Distributed Research Network, the PCORI-funded National Patient-Centered Clinical Research Network (PCORnet), and the FDA-funded Sentinel program. The tool is therefore highly scalable and can have immediate impacts on real-world big data analysis. The multidisciplinary study team includes researchers who pioneered some of the distributed regression approaches and experts who have extensive experience in multi-center studies. The distributed regression method has great potential to shift the paradigm of multi-center big biomedical research, from transferring of potentially identifiable patient-level data to the sharing of non-identifiable summary-level information. The proposed software tool will be a major step towards real-world application of this state-of-the-art privacy-protecting analytic approach.
The proposed project will develop a software tool that allows users to perform multivariable-adjusted regression analysis using information collected at two different data sources without physically combining the datasets. Instead of sharing potentially identifiable patient-level information, the tool only requires sharing of non-identifiable summary-level information across sites. The tool has great potential to help protect data privacy and patient confidentiality.
Yoshida, Kazuki; Gruber, Susan; Fireman, Bruce H et al. (2018) Comparison of privacy-protecting analytic and data-sharing methods: A simulation study. Pharmacoepidemiol Drug Saf 27:1034-1041 |
Wong, Jenna; Horwitz, Mara Murray; Zhou, Li et al. (2018) Using machine learning to identify health outcomes from electronic health record data. Curr Epidemiol Rep 5:331-342 |
Connolly, John G; Wang, Shirley V; Fuller, Candace C et al. (2017) Development and application of two semi-automated tools for targeted medical product surveillance in a distributed data network. Curr Epidemiol Rep 4:298-306 |
Toh, Sengwee (2017) Pharmacoepidemiology in the era of real-world evidence. Curr Epidemiol Rep 4:262-265 |
Li, Xiaojuan; Young, Jessica G; Toh, Sengwee (2017) Estimating Effects of Dynamic Treatment Strategies in Pharmacoepidemiologic Studies with Time-varying Confounding: A Primer. Curr Epidemiol Rep 4:288-297 |