Quantitative Definition of Cell Identity by Integrating Transcriptomic, Epigenomic, and Spatial Features of Individual Cells Abstract Defining the molecular features that identify the myriad specialized subsets of cells within the human body is foundational to a genomic approach to medicine. High-throughput single-cell sequencing has recently opened the door to comprehensively characterizing the molecular identities of human cells. Multiple types of features contribute to cell identity, including gene expression, epigenomic modifications, and spatial location within a tissue, but it is not currently possible to simultaneously measure all of these modalities within the same single cells. Each experimental context and measurement modality provides a different glimpse into cellular identity, and how to combine these views into a unified picture of cell identity remains unclear. Computational integration of multiple single cell experiments performed on different individual cells pro- vides a way forward despite these challenges. However, existing approaches are not sufficiently robust to inte- grate single cell data across the full range of biological contexts, nor flexible enough to leverage the unique properties of different single cell modalities, and require recalculating results each time new data points arrive. We recently developed LIGER, a highly robust and flexible algorithm that can integrate single cell data sharing a common set of gene-centric features across a wide range of biological contexts and modalities. A key property of our approach is the ability to identify both shared and dataset-specific features that define cell types across biological contexts. Additionally, LIGER is built upon a powerful matrix factorization framework that is readily extensible. In preliminary analysis, we showed that our approach can identify cell-type-specific sexually dimorphic gene expression and human subject variation, map cell types across species, and jointly define cell types from multiple single cell modalities that share corresponding features. Here, we build upon LIGER in several ways to develop a comprehensive framework that can most ef- fectively leverage the unique aspects of transcriptomic, epigenomic, and spatial data for quantitative definition of cell identity. First, we develop an ?online learning? algorithm that readily scales to millions of cells and can continually incorporate new data, allowing iterative refinement of cell identity (Aim 1). Second, we develop novel approaches to integrate single-cell modalities that assay different types of features (such as genes and intergenic peaks) and contain missing data (as in spatial transcriptomic datasets), enabling inference of epige- nomic regulation and cross-modal data imputation (Aim 2). In collaboration with biomedical scientists, we apply our approach to newly generated single cell transcriptomic and single cell epigenomic data from mouse skele- tal stem cells and experimentally validate the predicted linkage between these data modalities (Aim 3). Our work addresses a crucial gap in analysis methods for single cell genomic data and paves the way for a quanti- tative definition of cell identity that incorporates multiple types of cellular features.

Public Health Relevance

Identifying the ?parts list? of the human body?that is, the types of cells present and their molecular properties?is a necessary prerequisite for understanding how cells go wrong in disease. This project develops algorithms for combining multiple kinds of molecular properties into a single unified catalog of cell types.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG010883-01
Application #
9861320
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
2019-09-03
Project End
2024-06-30
Budget Start
2019-09-03
Budget End
2020-06-30
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of Michigan Ann Arbor
Department
Biostatistics & Other Math Sci
Type
Schools of Medicine
DUNS #
073133571
City
Ann Arbor
State
MI
Country
United States
Zip Code
48109