Single cell RNA-Sequencing (scRNA-Seq) is starting a revolution on cancer genomic research. This technique is able to measure gene expression of each individual cell, and hundreds to thousands of cells in a single experiment. Tumors are inherently heterogeneous, in which different types of cells cooperate with each other so that this deadly disease can invade, metastasize, and develop therapy resistance. ScRNA-Seq gives people unprecedented abilities to determine and describe these cell types, and thus uncover the mechanisms behind cancer. Ultimately, this insight will lead to better therapy against cancer. The tools of single cell sequencing also have direct translational applications in the clinic, in areas such as early detection, noninvasive monitoring, and guiding targeted therapy. However, the development of data analysis methods is seriously lagging, which may have led to dubious biological findings and can seriously obstruct future discoveries. ScRNA-Seq data show distinct features that can cause real problems. First and foremost, many genes (even moderately or highly expressed genes) have expression measurements zero, and many of these zeros are experimental artifacts. Second, the gene expression measurement can be heavily biased by the cell-cycle effect. Both features can cause serious difficulties in identifying and describing cell types and in many other applications of scRNA-Seq data. The goal of this proposal is to develop stand-alone methods that handle these features and deliver clearer and less biased data, which will then boost the power of subsequent statistical analyses and facilitate exciting biological discoveries. We propose Aim 1 to detect zeros that are experimental artifacts and infer their ?true? values, and Aim 2 to remove the cell-cycle effect. Moreover, we will develop user- friendly software to implement our algorithms and make them publicly available, which is our Aim 3. Our algorithms/software will output a much clearer and less biased data, on which, to answer any specific biological question of interest, cancer researchers can either use existing algorithms developed for bulk-based RNA-Seq data or microarray data, or develop new algorithms without special care for the several troublesome features of the raw scRNA-Seq data. This greatly reduces the load of data analysis and will accelerate biological/medical discoveries. We expect that our software serves as an essential pre-processing step for any application of scRNA-Seq data.

Public Health Relevance

Single-cell RNA-Sequencing gives researchers unprecedented abilities in determining and describing different cell types and their dynamics in tumor tissues, and thus uncovering the mechanisms behind cancer and finally leading to better therapy against cancer. In this proposal, we propose new strategies and statistical methods to clean up the very noisy and highly biased data generated by this pioneering technique. The cleaned data we deliver will greatly simplify follow-up data analysis and accelerate biological/medical discoveries.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Small Research Grants (R03)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Notre Dame
Biostatistics & Other Math Sci
Schools of Arts and Sciences
Notre Dame
United States
Zip Code