With the surge of large genomics data, there is an immense increase in the breadth and depth of different genomics datasets and an increasing importance in the topic of privacy of individuals in genomic data science. Detailed genetic and environmental characterization of diseases and conditions relies on the large-scale mining of genotype-phenotype relationships; hence, there is great desire to share data as broadly as possible. The recent change in NIH policy of sharing genomic summary results is a great step towards making the data available to broader researchers. However, privacy studies inferring study participations is outdated compared to the pace of the technological advancements in genome sequencing. A key first step in reducing private information leakage is to measure the amount of information leakage, particularly under different scenarios. To this end, we propose to derive information- theoretic measures for private information leakage in different genomic data sharing scenarios, especially when the datasets are noisy and incomplete. We will also develop various risk assessment tools. We will approach the privacy analysis under three aims. First, we will develop statistical metrics that can be used to quantify the sensitive information leakage in different data sharing scenarios as well as under the conditions when the genotype data is imperfect. We will systematically analyze the risk of inference of study participation of a patient. Second, we will design a plausible privacy attack through an experimental study, in which different technologies will be used to sequence genomes from trace amount of samples such as touch objects or used glasses. This will allow us to study the plausible scenarios of surreptious DNA testing and its effect on genomic data sharing. Third, we will develop risk assessment tools for sharing genomic summary results. These tools will simulate hundreds of scenarios learned through simulations in aim 1 and real-life privacy attacks in aim 2 to quantify the risks before the release of the data. These tools will be implemented using cryptographic techniques to further reduce the private information leakage during risk assessment step. During the K99 phase, the aim of this project is to find minimum amount of genotyping information required and maximum amount of noise tolerated for detection of a genome in a mixture using simulations and wet-lab experiments. To accomplish this research goal, the K99 phase will involve training in molecular biology, genomics and privacy. This training will take place at Yale University in the department of Molecular Biophysics and Biochemistry, under the mentorship of Dr. Mark Gerstein (genomics and privacy) and Dr. Andrew Miranker (molecular biology). Building on the training during the K99, the goal of the R00 phase will be simulation of the results of the experimental training to increase the sample size and building privacy risk assessment tools with the results learned from the experiment and simulations and implementation of such tools using cryptographic techniques.

Public Health Relevance

The only way to significantly increase the participation to large-scale genomic studies to further biomedical research and human health is to protect patient privacy. The ability to accomplish this is dependent upon developing computational techniques that quantifies the privacy risk and provides alternative ways to share the data while maximizing the utility. The proposed research combines novel statistical and molecular biology techniques to 1) quantify the amount of private information leakage in genomic summary results, and 2) develop tools to asses risk of privacy and means to share data while balancing usability and privacy.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Career Transition Award (K99)
Project #
1K99HG010909-01A1
Application #
10053985
Study Section
National Human Genome Research Institute Initial Review Group (GNOM)
Program Officer
Sofia, Heidi J
Project Start
2020-08-05
Project End
2022-07-31
Budget Start
2020-08-05
Budget End
2021-07-31
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Yale University
Department
Biochemistry
Type
Schools of Medicine
DUNS #
043207562
City
New Haven
State
CT
Country
United States
Zip Code
06520