Single cell RNA-seq (scRNA-seq) profiling provides an unprecedented opportunity to conduct detailed cellular analysis of cell subpopulations. Fulfilling the promise of scRNA-seq for biomedical studies and biomarker discovery requires robust computational approaches to support detection of rare phenotypes and unanticipated cellular responses. Current approaches for imputation, calibration, clustering and visualizing of scRNA-seq data suffer from challenges such as erroneous imputing of non-expressed genes, limitation of linear assumptions in removal of multivariate batch effects, and inefficiencies of clustering and dimensional reduction methods of very large datasets. We have developed spectral, neural network, and Fast Multipole Methods (FMM) prototypes suitable for addressing these issues in the context of scRNA-seq and other high throughput data contexts and propose to further develop and adapt these methods to scRNA-seq data analysis. Our team of experts on data analytics and computational biology is currently funded through the NIH BD2K initiative to develop novel big data tools and methods that have broad applicability to biomedical science. This effort proved the feasibility of extremely efficient scalable prototypes of neural network, spectral, and harmonic analysis techniques suitable for calibrating, reducing the dimensionality and visualizing high dimensional data, finding intrinsic state-probability densities, and co-organizing cells, markers and samples. We propose substantial advances over existing analytical procedures used in single cell RNA-seq studies including matrix recovery approaches for the sparse and noisy scRNA-seq data by combining matrix completion and statistical techniques (Aim 1A), and calibration based on our unsupervised MMD-ResNet neural network prototype and optimal transport theory (Aim 1B). We will develop a variant of the FMM approach to speed up the calculation of the repulsion term of the t-distributed stochastic neighbor embedding (t-SNE) visualization technique, which will improve our current fastest t-SNE FFT-based FIt-SNE prototype, and develop new reliable approximate nearest neighbors approaches to speed up the computation of the attraction term of t-SNE and other clustering algorithms (Aim 2A). Our additional variants of t-SNE will be further developed to allow better separation between clusters of cell subpopulations (late exaggeration) and better visualization using 1D t-SNE for heatmap gene-cell representation (Aim 2A). We will adapt SpectralNet, our efficient neural network approach, for computing graph Laplacian eigenvectors for large datasets. This will enable computation of spectral clustering, diffusion maps and manifold learning that are utilized in many scRNA studies but are currently limited to a moderate number of single cells (Aim 2B). Finally, we will develop a kernel based differential abundance algorithm to characterize differences between biological conditions (Aim 2C). We will adopt appropriate sampling approaches to significantly improve current methods.
This research plan aims at developing scalable spectral computational and neural network tools suitable for analyzing very large single cell RNA sequencing datasets. Specifically, our team which is led by a computational biologist, two prominent applied mathematicians and two prominent biologists will develop and validate scalable and novel techniques for: (i) completing missing values prevalent in these measurements, (ii) removing batch effect, (iii) reducing dimensionality and visualizing very large single cell RNA sequencing datasets and enabling detection of rare cell populations as well as detecting minute changes of cell populations between biological conditions. PHS 398/2590 (Rev. 06/09) Page Continuation Format Page