Deep tensor genomic imputation

Noble, William

Abstract

High-throughput sequencing assays allow scientists to measure biochemical properties like transcription factor binding, histone modi?cations, and gene expression in nearly any cell line or primary tissue (?biosample?). Unfortunately, measuring all possible biochemical properties in every biosample is infeasible, both because of limited sample availability and because the cost would be prohibitive. We have previously developed a state-of- the-art imputation method, called Avocado, that can ?ll in the holes in such data sets. Avocado couples tensor factorization with a deep neural network. The method is scalable to large data sets and provides more accurate imputations than competing methods such as ChromImpute or PREDICTD. We have already applied Avocado systematically to the NIH ENCODE data set and made the imputations publicly available via the ENCODE web por tal. Here, we propose to extend Avocado in four important ways. First, we will extend Avocado to handle single-cell data sets, thereby effectively turning each single-cell experiment into an in silico co-assay that measures multiple properties of each cell in parallel. Second, we will extend Avocado to work with data such as Hi-C, which measures three-dimensional properties of DNA. The extension involves converting Avocado's 3D tensor (biosample assay genomic position) to a 4D tensor with two genomic position axes. This extension will apply to a wide variety of data types, including various types of Hi-C data, SPRITE, GAM, ChIA-PET and PLAC-seq. Third, we will enhance Avocado to use variant aware genomic sequence to enable high-resolution imputation of regulatory pro?les. Finally, we will leverage the imputed data to infer cis-regulatory sequence annotations and the molecular impact of regulatory non-coding variants in one of the most comprehensive collections of cellular contexts. All of the software produced by this project will be open source, and all of the imputed data and latent factorizations will be made publicly available via the web portals associated with the NIH 4D Nucleome and ENCODE Consortia, providing a valuable public resource for users of these data sets.

Public Health Relevance

High-throughput sequencing can be used to measure many types of biochemical activity along the genome in a huge variety of primary cell types and cell lines (?biosamples?), but it is prohibitively expensive to measure all possible types of activity in all possible biosamples. Accordingly, we have developed a powerful machine learning approach to predict such measurements before they are performed. Here, we propose to increase Avocado's utility by extending the method to work with single-cell data types, to work with 3D genome architecture data, to incorporate information about DNA sequence and to decode regulatory DNA sequence and non-coding genetic variation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 1R01HG011466-01
Application #: 10096947
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Gilchrist, Daniel A

Project Start: 2021-02-01
Project End: 2025-01-31
Budget Start: 2021-02-01
Budget End: 2022-01-31
Support Year: 1
Fiscal Year: 2021
Total Cost
Indirect Cost

Deep tensor genomic imputation
Noble, William Stafford
University of Washington, Seattle, WA, United States

Abstract

Public Health Relevance

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Public Health Relevance

Funding Agency

Institution

Comments