The investigators have two aims in this proposal that fall at the interface of numerical algebra and statistical inference. The first aim is to extend the use of randomized approximation in a variety of dimension reduction methods that rely on numerical linear algebra both supervised and unsupervised as well as linear and nonlinear and develop a statistical bases for these methods in addition to the computational motivation of being applicable to massive data. The other motivation is to extend these statistical methods for dimension reduction to multiway data using numerical multilinear algebra, a recent new development in numerical analysis. These projects will increase interaction between statistical inference and numerical analysis and benefit both fields, providing new perspectives to how we view and perform data analysis.
Numerical methods with statistical implications are central to a variety of technologies used by the general population. These technologies include Google's pagerank algorithm, genetic methods used to find genetic variation related to disease, compressing of medical images for storage and treatment, as well as applications in geostatistics. In all the previous cases the fundamental idea is to condense massive data in a useful summary with respect to a desired goal. The two ideas in this proposal are (1) to study how numerical methods that scale to the massive data generated in modern scientific, engineering, and social applications impose statistical assumptions or models on the data, (2) to study more complex interactions or properties of the data than examined in current methods. The motivation behind the first aim is to understand how numerical approximations required for computational scaling as we collect more data impact the information that can be extracted from these data -- for what type of data and applications do certain numerical approximations work well. The motivation behind the second aim is to go beyond the broad category of standard statistical methods take into account the relation between pairs of objects -- two web pages that are linked for Google's pagerank, the correlation between two genes or two loci in genetics applications. The question behind this aim is whether richer sources of information can be extracted by examining the links between three web pages or three loci. The research involved in this aim consists of the development of computationally efficient algebraic methods to extract this information and understanding the statistical models implemented by these methods.