Modern data acquisition technology produces new types of data that carry rich information but also poses new challenges for analysis. In many modern datasets, the basic unit of measurement can be a matrix or even higher order array recording the interactions among one or multiple groups of individuals. For example, a gene co-expression network measures the average strength of correlation between each pair of genes in a particular organ tissue. With gene co-expression networks collected at different developmental stages, it is possible to understand how groups of genes change their behavior in a coherent way. As another example, next generation sequencing techniques are able to produce gene expression data at different scales: Tissue sample data consists of gene expressions in bulk tissue samples, whereas single cell RNA sequencing data contains expressions of the same genes for individual cells. Motivated by the these examples, this research work aims at developing novel probability tools and statistical inference methods for complex matrix valued datasets, which will enable scientists to uncover salient structures in such datasets in a coherent and efficient way. The project also provides research training opportunities for graduate students.
This project consists of two parts. In the first part, the PI studies multiple layer networks with a shared latent structure across layers and develops methods to efficiently combine the information across different layers to recover the latent structure, which would be impossible if only a single layer were available. The expected results will provide new probability theorems describing the behavior of random noises in matrix forms, as well as their linear combinations and higher order functions. In the second part, the PI studies a series of inference problems related to tissue and single cell RNA-seq data, starting from dimensionality reduction and variable selection in a computationally efficient manner, followed by downstream inference problems such as cell type deconvolution in tissue RNA-seq data. The expected results will provide an important addition to the sparse principal components analysis literature, by developing a projection-free, gradient-based algorithm with provable global convergence properties. The cell type deconvolution problem will be an interesting application combining techniques from variable selection, nonnegative matrix factorization, and optimization.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.