With the surge of large genomics data, there is an immense increase in the breadth and depth of different omics datasets and an increasing importance in the topic of privacy of individuals in genomic data science. Detailed genetic and environmental characterization of diseases and conditions relies on the large-scale mining of functional genomics data; hence, there is great desire to share data as broadly as possible. However, there is a scarcity of privacy studies focused on such data. A key first step in reducing private information leakage is to measure the amount of information leakage in functional genomics data, particularly in different data file types. To this end, we propose to to derive information-theoretic measures for private information leakage in different data types from functional genomics data. We will also develop various file formats to reduce this leakage during sharing. We will approach the privacy analysis under three aims. First, we will develop statistical metrics that can be used to quantify the sensitive information leakage from raw reads. We will systematically analyze how linking attacks can be instantiated using various genotyping methods such as single nucleotide variant and structural variant calling from raw reads, signal profiles, Hi-C interaction matrices, and gene expression matrices. Second, we will study different algorithms to implement privacy-preserving transformations to the functional genomics data in various forms. Particularly, we will create privacy-preserving file formats for raw sequence alignment maps, signal track files, three-dimensional interaction matrices, and gene expression quantification matrices that contain information from multiple individuals. This will allow us to study the sources of sensitive information leakages other than raw reads, for example signal profiles, splicing and isoform transcription, and abnormal three-dimensional genomic interactions. Third, we will investigate the reads that can be mapped to the microbiome in the raw human functional genomics datasets. We will use inferred microbial information to characterize private information about individuals, and then combine the microbial information with the information from human mapped reads to increase the re-identification accuracy in the linking attacks described in the second aim. We will use the tools to quantify the sensitive information and privacy-preserving file formats in the available datasets from large sequencing projects, such as the ENCODE, The Cancer Genome Atlas, 1,000 Genomes, gEUVADIS, and Genotype-Tissue Expression projects.

Public Health Relevance

Sharing large-scale functional genomics data is critical for scientific discovery, but comes with important privacy concerns related to the possible misuse of such data. This proposal will quantify and manage the rieslkasted to releasing functional genomics datasets, based on integrating inferred genotypes from the raw sequence files, signal tracks, and microbiome mapped sequences. Finally, we will develop file formats, statistical methodologies, and related software for anonymization of functional genomics data that enable open sharing.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sofia, Heidi J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Yale University
Schools of Medicine
New Haven
United States
Zip Code