Much of modern day medicine is driven by genomic data, with the size and complexity of genomic datasets increasing at a rapid pace. Naturally, any use of human genomic data raises grave privacy concerns. This is because the power to query multiple genomic databases with seemingly innocuous questions such as "Do you contain any genome that has mutation X?" is enough to determine whether an individual's genome is present in the databases. Such re-identification attacks have raised a germane question: can one implement privacy protection for genomic data so that meaningful data analysis remains possible, but attacks such as these become impossible? The main idea of this project is to achieve this goal by making and exploiting statistical assumptions about the data, such that if the assumptions are false, data analysis will suffer but privacy will not. The project will also generate curricular material for a graduate class at the intersection of data privacy, machine learning, and genomics.

The project considers three major research questions on preserving privacy in the context of genomic data. The notion of privacy used is differential privacy, which provably protects against re-identification attacks, and has found large-scale adoption in both academia and industry. The first research question is the estimation of allele frequencies, and of linkage disequilibrium, while preserving individual privacy. Given a set of human genomes, the objective of allele frequency estimation is to estimate the frequency of the different mutations across various locations in the chromosome. Linkage disequilibrium is the deviation from independence for pairs of alleles. The second question is haplotype sampling. Haplotypes correspond to sets of genetic variations (typically extending over multiple genes), that tend to be inherited together. In haplotype sampling, the objective is to generate synthetic haplotypes given a data set of human genomes, while respecting biology behind these genetic variations. Finally, the project aims to estimate pathogenic variants of breast cancer genes. Variants of the BRCA 1 and 2 genes are known to be pathogenic for breast cancer. However, a lot of the variants are still not classified as pathogenic / non-pathogenic and are VUSs - Variants of Unknown Significance. The objective is to develop a privacy-preserving system to gather statistics about the VUSs from individually sequenced genes.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1839317
Program Officer
Tracy Kimbrel
Project Start
Project End
Budget Start
2018-10-01
Budget End
2021-09-30
Support Year
Fiscal Year
2018
Total Cost
$527,124
Indirect Cost
Name
University of California Santa Cruz
Department
Type
DUNS #
City
Santa Cruz
State
CA
Country
United States
Zip Code
95064