Privacy is receiving much attention with the unprecedented increase in the breadth and depth of biomedical datasets, particularly personal genomics datasets. Most studies on genomic privacy are focused on protection of variants in personal genomes. Molecular phenotype datasets, however, can also contain substantial amount of sensitive information. Although there is no explicit genotypic information in them, subtle genotype-phenotype correlations can be used to statistically link the phenotype and genotype datasets. We will study the methodologies for analysis of sensitive information leakage from phenotype datasets. We will focus on the RNA-seq datasets and the associated sources of sensitive information leakage. These leakages are mediated by the expression quantitative trait loci. We will approach the privacy analysis under 3 aims. We will first aim at proposing statistical metrics that can be used for quantification of the sensitive information leakage from phenotype datasets. These quantifications can be used to evaluate the risks of privacy breaches. In the second aim, we will focus systematical analysis of how linking attacks can be instantiated and analyzed. We will study how one can generalize linking attacks that enables the privacy researchers study the risks associated with these attacks more systematically. We will then evaluate different models of genotype prediction and assess how these can be used in linking attacks. We will focus, specifically, on the outlier gene expression levels and evaluate how the outliers can be used for genotype prediction and in the linking attacks. In the third aim, we will develop tools that implement the quantification, risk estimation, and risk management methodologies and integrate these in a coherent software suite for a comprehensive privacy analysis, which enables protecting RNA-seq datasets at different levels of summarizations of the datasets, e.g., reads, gene and transcript quantifications. We will aim at increasing the number of software tools for genomic privacy analysis. We will study different algorithmic approaches to tackle with the high computational complexity of anonymization techniques in the literature. We will study sources of sensitive information leakage other than gene expression levels, e.g. splicing and non-coding transcription. These sources of information will be studied in the context of risk quantification and management strategies presented in the previous aims. We will finally use the tools to quantify the sensitive information in the publicly available datasets from large sequencing projects, for example ENCODE, 1000 Genomes, TCGA, GEUVADIS, and GTex.

Public Health Relevance

We plan to study genomic privacy with a focus on quantification and management of risks related to releasing RNA-seq datasets. We will study linking attacks where the individual?s privacy can be compromised by linking of genotype and gene expression datasets, mediated by use of eQTLs, i.e., expression quantitative trait loci for genotype prediction. We will develop statistical methodolodies and related software tools for anonymization of gene expression datasets.

Agency
National Institute of Health (NIH)
Institute
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01EB023686-03
Application #
9527827
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Peng, Grace
Project Start
2016-09-23
Project End
2019-06-30
Budget Start
2018-07-01
Budget End
2019-06-30
Support Year
3
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Yale University
Department
Biochemistry
Type
Schools of Arts and Sciences
DUNS #
043207562
City
New Haven
State
CT
Country
United States
Zip Code
Greenbaum, Dov; Rozowsky, Joel; Stodden, Victoria et al. (2017) Structuring supplemental materials in support of reproducibility. Genome Biol 18:64
Harmanci, Arif; Gerstein, Mark (2016) Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat Methods 13:251-6