Single-cell genomic technologies such as single-cell RNA-seq have emerged as powerful techniques to quantify molecular states of individual cells and can be used to elucidate the cellular building blocks of complex tissues and diseases. Given recent rapid advances in single-cell technologies, novel statistical and computational approaches are needed to efficiently analyze large-scale single-cell datasets with multiple data types such as gene and protein expression. Discrete Bayesian hierarchical models have been widely used for unsupervised modeling of discrete data types in fields such as Nature Language Processing (NLP). We have developed a Bayesian hierarchical model called Cellular Latent Dirichlet Allocation (Celda) to perform bi-clustering of genes into modules and cells into subpopulations. We will develop novel models that can perform clustering of cells into subpopulations using multi-modal genomic data or clustering of patients into subgroups using both single- cell data and patient-level characteristics. These novel methods will be made available in a scalable and interpretable cloud-based framework accessible to both computational and non-computational users.
The aims of this study are to (1) develop novel models to perform integrative multi-modal and multi-level clustering with single-cell data, (2) develop an R package and cloud-based platform with a web interface for rapid inference and visualization of large-scale datasets, and (3) apply Celda models to single-cell datasets from a variety of biological settings including cancer, lung development, and immunology. Overall, these aims will be accomplished by an interdisciplinary team with strong expertise in computational biology and bioinformatics, biostatistics, computer science, and molecular and cellular biology.
Single-cell genomic technologies have emerged as powerful techniques to quantify molecular states of individual cells and can be used to elucidate the cellular building blocks of complex tissues and diseases. We will develop novel discrete Bayesian hierarchical models that can cluster cells into subpopulations using multiple data types and cluster patients into subgroups using both single-cell data and patient-level characteristics. These novel methods will be made available in a scalable cloud-based framework accessible to both computational and non- computational users.