High-throughput sequencing assays allow scientists to measure biochemical properties like transcription factor binding, histone modi?cations, and gene expression in nearly any cell line or primary tissue (?biosample?). Unfortunately, measuring all possible biochemical properties in every biosample is infeasible, both because of limited sample availability and because the cost would be prohibitive. We have previously developed a state-of- the-art imputation method, called Avocado, that can ?ll in the holes in such data sets. Avocado couples tensor factorization with a deep neural network. The method is scalable to large data sets and provides more accurate imputations than competing methods such as ChromImpute or PREDICTD. We have already applied Avocado systematically to the NIH ENCODE data set and made the imputations publicly available via the ENCODE web por tal. Here, we propose to extend Avocado in four important ways. First, we will extend Avocado to handle single-cell data sets, thereby effectively turning each single-cell experiment into an in silico co-assay that measures multiple properties of each cell in parallel. Second, we will extend Avocado to work with data such as Hi-C, which measures three-dimensional properties of DNA. The extension involves converting Avocado's 3D tensor (biosample assay genomic position) to a 4D tensor with two genomic position axes. This extension will apply to a wide variety of data types, including various types of Hi-C data, SPRITE, GAM, ChIA-PET and PLAC-seq. Third, we will enhance Avocado to use variant aware genomic sequence to enable high-resolution imputation of regulatory pro?les. Finally, we will leverage the imputed data to infer cis-regulatory sequence annotations and the molecular impact of regulatory non-coding variants in one of the most comprehensive collections of cellular contexts. All of the software produced by this project will be open source, and all of the imputed data and latent factorizations will be made publicly available via the web portals associated with the NIH 4D Nucleome and ENCODE Consortia, providing a valuable public resource for users of these data sets.

Public Health Relevance

High-throughput sequencing can be used to measure many types of biochemical activity along the genome in a huge variety of primary cell types and cell lines (?biosamples?), but it is prohibitively expensive to measure all possible types of activity in all possible biosamples. Accordingly, we have developed a powerful machine learning approach to predict such measurements before they are performed. Here, we propose to increase Avocado's utility by extending the method to work with single-cell data types, to work with 3D genome architecture data, to incorporate information about DNA sequence and to decode regulatory DNA sequence and non-coding genetic variation.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Gilchrist, Daniel A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Schools of Medicine
United States
Zip Code