In the era of Big Data, technological advances have brought significant changes in the amount and the complexity of data generated in almost every discipline from astronomy to genomics to medicine. It has become an essential component of the intellectual endeavor to find meaningful patterns and extract relevant information from large scale, high-dimensional data in a reliable and efficient fashion. Understanding and capturing the regular structures underlying the data is crucial for subsequent modeling and prediction. Low-dimensional projections of data are often primary tools for uncovering the structure and coping with high-dimensionality along with other techniques for sparsity or structural simplicity. Methods for dimension reduction will help the process of gathering information from data significantly. This project concerns nonlinear dimension reduction methods which can be viewed as an extension of standard principal component analysis (PCA) - a widely used tool for low-rank approximation of data. The research aims to expand the scope of PCA to various types of data from binary to ordinal responses to counts, and unravel the data embeddings given by nonlinear extensions of PCA. Enhanced understanding of the existing tools and the development of new tools in this research will improve statistical practice in many ways.
This project is primarily focused on investigation of two nonlinear extensions of PCA: kernel PCA and generalized PCA, for various data types including the exponential family data. This research has two specific aims: (i) to understand the geometry of the nonlinear data embeddings given by the kernel PCA through the spectral analysis of the kernel operator, and the effect of a kernel and centering kernels on those nonlinear principal components for clustering in relation to the data distribution, and (ii) to develop statistically principled extensions of the PCA methodology for analysis and modeling of data matrices from the exponential family distributions using generalized linear model framework. On the methodological aspect, the research parallels the coherent extension of linear model to generalized linear model framework for the best low-rank approximation of data. Computational tools will be developed for a wide range of applications of the studied methods.