Genomics is a fast evolving field that has led to many advances in data-acquisition technologies at the molecular level and analytics algorithms for high-dimensional data, as well as discoveries of biological insights and mechanisms. This project aims to develop data-analytics algorithms to address computational challenges in genomics. Genomics analyses often generate complex datasets that are large in size and high in dimensionality. In addition, the datasets often contain considerable amounts of missing values, which makes computational analysis and biological interpretation quite challenging. In almost all existing computational algorithms, the missing values are treated as a problem to be fixed. In contrast, this project takes a novel view of the missing values, and considers them as useful signals. This is a unique perspective, opposite to the common assumptions on missing values in the literature. Developing and establishing this perspective has the potential to reveal new biological insights, inspire new directions for algorithmic developments, and generate educational and training opportunities in terms of both course materials and research activities.
The goal of this project is to examine the utility of missing values in two types of genomic analyses. Aim 1 of this project focuses on single-cell RNA-sequencing analysis. Since signals at the single-cell level are typically quite low and stochastic, single-cell RNA sequencing often produces gene-expression datasets that are highly sparse, with the percentage of zeros typically greater than 90%. This aim will develop and validate algorithms to identify and cluster cell types based on the missing values in single-cell RNA sequencing data. Aim 2 of this project focuses on bulk-tissue RNA-sequencing analysis. Although bulk-tissue RNA sequencing provides stronger signal than single cells, missing values also exist in bulk-tissue data, mainly due to tissue specificity of gene expression. This aim will examine the patterns of missing values across large-scale gene-expression datasets spanning across multiple tissue types and diseases, and explore the utility of missing values in correlating with clinical variables of interest.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.