The investigators have two aims in this proposal that fall at the interface of numerical algebra and statistical inference. The first aim is to extend the use of randomized approximation in a variety of dimension reduction methods that rely on numerical linear algebra both supervised and unsupervised as well as linear and nonlinear and develop a statistical bases for these methods in addition to the computational motivation of being applicable to massive data. The other motivation is to extend these statistical methods for dimension reduction to multiway data using numerical multilinear algebra, a recent new development in numerical analysis. These projects will increase interaction between statistical inference and numerical analysis and benefit both fields, providing new perspectives to how we view and perform data analysis.

Numerical methods with statistical implications are central to a variety of technologies used by the general population. These technologies include Google's pagerank algorithm, genetic methods used to find genetic variation related to disease, compressing of medical images for storage and treatment, as well as applications in geostatistics. In all the previous cases the fundamental idea is to condense massive data in a useful summary with respect to a desired goal. The two ideas in this proposal are (1) to study how numerical methods that scale to the massive data generated in modern scientific, engineering, and social applications impose statistical assumptions or models on the data, (2) to study more complex interactions or properties of the data than examined in current methods. The motivation behind the first aim is to understand how numerical approximations required for computational scaling as we collect more data impact the information that can be extracted from these data -- for what type of data and applications do certain numerical approximations work well. The motivation behind the second aim is to go beyond the broad category of standard statistical methods take into account the relation between pairs of objects -- two web pages that are linked for Google's pagerank, the correlation between two genes or two loci in genetics applications. The question behind this aim is whether richer sources of information can be extracted by examining the links between three web pages or three loci. The research involved in this aim consists of the development of computationally efficient algebraic methods to extract this information and understanding the statistical models implemented by these methods.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1209155
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2012-07-01
Budget End
2015-06-30
Support Year
Fiscal Year
2012
Total Cost
$150,001
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705