With the surge of large genomics data, there is an immense increase in the breadth and depth of different omics datasets and an increasing importance in the topic of privacy of individuals in genomic data science. Detailed genetic and environmental characterization of diseases and conditions relies on the large-scale mining of functional genomics data; hence, there is great desire to share data as broadly as possible. However, there is a scarcity of privacy studies focused on such data. A key first step in reducing private information leakage is to measure the amount of information leakage in functional genomics data, particularly in different data file types. To this end, we propose to to derive information-theoretic measures for private information leakage in different data types from functional genomics data. We will also develop various file formats to reduce this leakage during sharing. We will approach the privacy analysis under three aims. First, we will develop statistical metrics that can be used to quantify the sensitive information leakage from raw reads. We will systematically analyze how linking attacks can be instantiated using various genotyping methods such as single nucleotide variant and structural variant calling from raw reads, signal profiles, Hi-C interaction matrices, and gene expression matrices. Second, we will study different algorithms to implement privacy-preserving transformations to the functional genomics data in various forms. Particularly, we will create privacy-preserving file formats for raw sequence alignment maps, signal track files, three-dimensional interaction matrices, and gene expression quantification matrices that contain information from multiple individuals. This will allow us to study the sources of sensitive information leakages other than raw reads, for example signal profiles, splicing and isoform transcription, and abnormal three-dimensional genomic interactions. Third, we will investigate the reads that can be mapped to the microbiome in the raw human functional genomics datasets. We will use inferred microbial information to characterize private information about individuals, and then combine the microbial information with the information from human mapped reads to increase the re-identification accuracy in the linking attacks described in the second aim. We will use the tools to quantify the sensitive information and privacy-preserving file formats in the available datasets from large sequencing projects, such as the ENCODE, The Cancer Genome Atlas, 1,000 Genomes, gEUVADIS, and Genotype-Tissue Expression projects.

Public Health Relevance

Sharing large-scale functional genomics data is critical for scientific discovery, but comes with important privacy concerns related to the possible misuse of such data. This proposal will quantify and manage the rieslkasted to releasing functional genomics datasets, based on integrating inferred genotypes from the raw sequence files, signal tracks, and microbiome mapped sequences. Finally, we will develop file formats, statistical methodologies, and related software for anonymization of functional genomics data that enable open sharing.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG010749-01A1
Application #
9970939
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sofia, Heidi J
Project Start
2020-09-02
Project End
2025-06-30
Budget Start
2020-09-02
Budget End
2021-06-30
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Yale University
Department
Biochemistry
Type
Schools of Medicine
DUNS #
043207562
City
New Haven
State
CT
Country
United States
Zip Code
06520