The undirected graphical model (GM), a powerful tool for investigating the relationship among a large number of random variables in a complex system, is used in a wide range of scientific applications, including image analysis, statistical physics, astrophysics, finance, and biomedical studies. With recent technological advances, unprecedented amounts of information can be collected for a given system, making meaningful inferential guarantees of GMs more challenging. Despite recent successes in development of methods and theory for Gaussian GMs, the underlying assumption of continuous and normally distributed data is violated for some important data types. For example, ordinal, binary and count data are all discrete in nature and cannot be naively transformed into Gaussian distributions. In biomedical studies, examples of non-Gaussian type data include DNA Copy Number Variation, mutation and (single cell) RNA-sequence data. Compared to recent advances in Gaussian GM, research in modeling and theoretical foundations for non-Gaussian data types has fallen behind. To bridge this gap, the PI will identify some of the major modeling and inferential challenges and propose several new graphical models for non-Gaussian data. In addition, the PI will further develop, evaluate and improve new statistical and computational inference methods for these models with theoretical guarantees.
The proposed research will significantly advance fundamental theoretical understanding on modeling and statistical inference of non-Gaussian data in graphical models via three tasks. (I) Development of a new two-step inference procedure to employ the covariate-adjusted truncated Poisson graphical model (TPGM) which provides a unified framework for modeling both binary and count type data. The inferential procedure fully respects the intrinsic sparse structure of the graph making it more reliable. A novel likelihood-based non-linear score vector for bias correction will be developed. (II) A novel zero-inflated TPGM fully accounting for the zero-inflation pattern in the data is proposed to model single cell RNA sequence data at the cell level. The inferential procedure based on EM algorithms paves a road to better understanding of the genetic networks in different cell types, and thus a better understanding of the mechanisms of various diseases. Theoretically, a composite-likelihood-based EM algorithm is utilized to overcome computational difficulties. (III) Development of a novel latent semiparametric graphical model to draw inferences on intrinsic graph structure by integrating both ordinal and continuous type data. The method takes into account potential confounding effects to draw meaningful conclusions. Beyond fundamental advances in statistical modeling and theory of graphical models, the research will have immediate impact in applications from a number of scientific disciplines including biology, pharmacy, finance and genomics. The results will be disseminated through publications, open-source software and presentations at conferences.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.