In a discrete multi-way layout, each of the factors affecting measured response takes on a finite number of levels, which may be either nominal (pure labels) or ordinal (real-values whose order and magnitude bear information). Often, only relatively few noisy observations are made at only some factor-level combinations that form a subset of all theoretically possible combinations of the factor levels. Basic problems in analyzing such sparse regression-type data are to extract efficiently signal from noise at the observed factor-level combinations; to assess intelligibly the uncertainty in the extracted signal; and to extrapolate plausibly the extracted signal to the unobserved factor-level combinations. This project will develop the following data-driven regularization methodology: use a tractable general probability model for the incomplete multi-way layout to motivate classes of candidate Bayes estimators for the means of the multi-way layout; b) estimate the risk of each candidate estimator under this general model; c) define the regularized estimator to be the candidate estimator with smallest estimated risk; d) prove theoretically that the risk of the regularized estimator converges asymptotically to that of the best candidate estimator; e) develop confidence sets centered at the regularized estimator that quantify the uncertainty of that estimator; f) experiment with the regularized estimators on case-study data and on pertinent artificial data.
Important practical instances of the multi-way layout data treated in the project are spatial data, the gene or protein chip data of bioinformatics, and the digital images and videos of medical imaging or of industrial quality control. The project will develop effective, semiautomatic algorithms for separating pattern from unimportant details or noise in such data. Thereby, it will serve to focus human intervention on the subtle questions that remain after effective algorithmic analyses of large multi-way layouts. The project will contribute to solving several core research challenges identified in Section 4 of the 2003 NSF Report "Statistics: Challenges and Opportunities for the Twenty-First Century". Amongst these technical challenges are Bayes and biased estimation, data reduction and compression, and structuring the interaction between statistical theory and computational experiments. Project results will be submitted for journal publication and will be posted on the PI's website. A unified account of the methodology is planned for a monograph. Portions of the project, including case study applications of the methodology to the data types listed above, will guide Ph.D. thesis research by the PI's students.