Efficient Methods for Dimensionality Reduction of Single-Cell RNA-Sequencing Data Single-cell RNA-sequencing is a revolutionary technology enabling discoveries in human physiology and disease. The datasets generated from single-cell RNA-sequencing experiments are so large that they cannot be analyzed or visualized using traditional statistical methods until the datasets have been shrunk using a technique named ?dimensionality reduction.? Almost every analysis of single-cell RNA-sequencing begins using a technique named principal component analysis (PCA) to accomplish dimensionality reduction. However, single-cell RNA-sequencing presents unique challenges making PCA difficult. First, the size of these datasets is so large that computing PCA requires specialized hardware and multiple hours. Fast algorithms to approximate PCA have been shown to dramatically speed up this process, but have not proliferated in the single cell-RNA sequencing community, in part because no parallelized algorithm has been written in the R computing language. Second, PCA requires the researcher to decide the final desired size of the dataset. Choosing too small of a size results in discarding valuable biological insights, while choosing too large a size increases the noise. However, there is no consensus on how to pick the optimal size for single-cell RNA sequencing, and there is evidence that this size might be systematically underestimated. Lastly, PCA cannot be applied directly to the count-data measured in single cell RNA sequencing, so researchers must first apply a preprocessing technique to normalize it. The current standard in the field is to apply the log transform ? however, several recent studies have shown that the log transform creates statistical biases in single-cell RNA sequencing. In this fellowship, specifically tailored, fast methods for performing PCA on single-cell RNA- sequencing data will be developed: 1a) A framework to rigorously measure the consequence of changing preprocessing parameters on the final results of several publicly available single cell RNA sequencing datasets to enable experimentation of PCA on single-cell RNA-sequencing data. 1b) An ultra-fast, parallelized implementation of randomized PCA allowing researchers using standard laptops to rapidly perform PCA on single cell RNA sequencing data. 2) A technique for rigorously choosing the final size when performing principal component analysis for single-cell RNA-sequencing datasets. 3) A method for transforming single-cell RNA-sequencing data so that it becomes appropriately distributed enabling proper usage of PCA without incurring statistical biases. This fellowship also includes a detailed training plan with valuable learning experiences for the applicant?s development as a physician-scientist who can apply methods from high dimensional-statistics to solving biomedical problems.

Public Health Relevance

Single-cell RNA-sequencing is a revolutionary technology enabling discoveries in human physiology and disease. The data sets collected by this technology are so large that they require specialized statistical techniques to analyze using cumbersome computing hardware ? however, these techniques were designed for a different type of data and consequently they create systematic biases. This fellowship seeks to design extremely fast data analysis tools tailored specifically to single-cell RNA-sequencin.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Individual Predoctoral NRSA for M.D./Ph.D. Fellowships (ADAMHA) (F30)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Gatlin, Tina L
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Yale University
Public Health & Prev Medicine
Schools of Medicine
New Haven
United States
Zip Code