Data science holds the promise of enabling new pathways to discovery and can improve the understanding, prevention and treatment of complex disorders such as cancer, diabetes, substance abuse, etc., which are significantly on the rise. The promise of data science can be fully realized only when collected data can be collaboratively shared and analyzed. However, the widespread increases in healthcare data breaches due to inappropriate access as well as the increasing number of novel privacy attacks restrict institutions from sharing data. Indeed, in some cases, the results of the analysis can themselves lead to significant privacy harm. The success of the data commons depends on ensuring the maximal access to data, subject to all of the patient privacy requirements including those mandated by legislation, and all of the constraints of the organization collecting the data itself. While there are existing solutions that can solve parts of the problem, there are significant challenges in truly incorporating these into comprehensive working solutions that are usable by the biomedical research community, and new challenges brought on by modern techniques such as deep learning. The long-term goal of this research is to develop technologies that can holistically enable data sharing while respecting privacy and security considerations and to ensure that they are implemented in existing platforms that have widespread acceptance in the research community. Towards this, the objective of this project is to develop complementary solutions for risk inference, distributed learning, and access control that can enable different modalities of data sharing. The problems studied are general in nature and will evolve depending on research successes and new impediments that arise. The proposed program of research is significant since lack of access to biomedical data can lead to fragmentation of care, resulting in higher economic and social costs, and is a significant impediment to biomedical research. The project will result in open-source, freely available software tools that will be integrated into widely used data collection, cohort identification, and distributed analytics platforms. There are several ongoing collaborations that will serve as initial pilot customers to provide use cases, identify the requirements, evaluate results, and in general validate the developed solutions.
Statement of Relevance to Public Health Being able to ensure privacy and security while enabling data sharing and analysis is critical to pave the way forward for public health research and improve our understanding of diseases. The proposed work will address the challenges that impede the use of data across all of the different modalities of data sharing. The integration into existing platforms will ensure that the developed models, tools, and solutions directly impact the research community and improve public health interventions.