Quantitative Definition of Cell Identity by Integrating Transcriptomic, Epigenomic, and Spatial Features of Individual Cells Abstract Defining the molecular features that identify the myriad specialized subsets of cells within the human body is foundational to a genomic approach to medicine. High-throughput single-cell sequencing has recently opened the door to comprehensively characterizing the molecular identities of human cells. Multiple types of features contribute to cell identity, including gene expression, epigenomic modifications, and spatial location within a tissue, but it is not currently possible to simultaneously measure all of these modalities within the same single cells. Each experimental context and measurement modality provides a different glimpse into cellular identity, and how to combine these views into a unified picture of cell identity remains unclear. Computational integration of multiple single cell experiments performed on different individual cells pro- vides a way forward despite these challenges. However, existing approaches are not sufficiently robust to inte- grate single cell data across the full range of biological contexts, nor flexible enough to leverage the unique properties of different single cell modalities, and require recalculating results each time new data points arrive. We recently developed LIGER, a highly robust and flexible algorithm that can integrate single cell data sharing a common set of gene-centric features across a wide range of biological contexts and modalities. A key property of our approach is the ability to identify both shared and dataset-specific features that define cell types across biological contexts. Additionally, LIGER is built upon a powerful matrix factorization framework that is readily extensible. In preliminary analysis, we showed that our approach can identify cell-type-specific sexually dimorphic gene expression and human subject variation, map cell types across species, and jointly define cell types from multiple single cell modalities that share corresponding features. Here, we build upon LIGER in several ways to develop a comprehensive framework that can most ef- fectively leverage the unique aspects of transcriptomic, epigenomic, and spatial data for quantitative definition of cell identity. First, we develop an ?online learning? algorithm that readily scales to millions of cells and can continually incorporate new data, allowing iterative refinement of cell identity (Aim 1). Second, we develop novel approaches to integrate single-cell modalities that assay different types of features (such as genes and intergenic peaks) and contain missing data (as in spatial transcriptomic datasets), enabling inference of epige- nomic regulation and cross-modal data imputation (Aim 2). In collaboration with biomedical scientists, we apply our approach to newly generated single cell transcriptomic and single cell epigenomic data from mouse skele- tal stem cells and experimentally validate the predicted linkage between these data modalities (Aim 3). Our work addresses a crucial gap in analysis methods for single cell genomic data and paves the way for a quanti- tative definition of cell identity that incorporates multiple types of cellular features.
Identifying the ?parts list? of the human body?that is, the types of cells present and their molecular properties?is a necessary prerequisite for understanding how cells go wrong in disease. This project develops algorithms for combining multiple kinds of molecular properties into a single unified catalog of cell types.