This project has two major components. The first is an investigation of visualization and analysis methods for data sets in high dimensions, with a focus on categorical variables whose number of unique levels is comparable to the total sample size. Examples of such variables include search query strings, ISBNs, song titles, author names, URLs, genotypes, environments, and customer ID numbers. The visualization methods are designed to show broad trends and to highlight anomalies. The inferential methods are of the sample reuse type: the bootstrap and cross-validation. New methods are necessary here because the data sets have complicated interlocking patterns that invalidate any IID sampling assumptions. The second component is better statistical inference by improving on their numerical methods. This includes calibration of empirical likelihood methods to get better coverage and to extend confidence regions for the mean beyond the convex hull of the data points. It also includes the embedding of quasi-Monte Carlo sampling methods into Markov chain Monte Carlo algorithms to combine the accuracy of the former and the wide applicability of the latter.
Exploratory data analysis of categorical variables is useful to see broad patterns including small groups of customers that have similar tastes for a small list of songs or books or movies. It is also useful to identify anomalies that may indicate abusive behavior, including cyber-attack, and what is commonly called spam in the online context. One of the original motivations for the sample reuse methods is in crop science. In some of those problems, a large number of plant varieties (genotypes) are grown under many different environmental conditions. A statistical model is used to determine which varieties to use in each environment. Earlier statistical methods were based on assumptions that don't fit this setting and they often did not select the best model. New methods from this project may therefore be used to select better models which then result in increased production of food and fiber. The empirical likelihood work is basic research aimed at removing unnecessary mathematical assumptions from statistical models in order to widen their applicability. The Monte Carlo sampling component of the project is basic research on a computational technique used extensively in physics as well Bayesian statistical inference.
For 2013/14, this project developed methods for choosing a representative set of points in a region. One paper developed a low discrepancy technique for sampling points inside of a triangle. It was previously known that there was a good way to do this; the paper showed exactly how to do so. There were several papers on measuring the relative importance of different input variables for a function that one can compute at any desired point. There were two papers on extensible quadrature rules. Prior to 2013/14 the project developed quasi-Monte Carlo sampling methods for Markov chain Monte Carlo, bootstrap and cross-validatory methods for high dimensional samples, visualization methods for internet data, and confidence interval methods for data that do not require the user to specify a parametric family for their observations. The broader impacts of this work include the following. Sampling points inside a triangle is an important step in graphical rendering as used in the motion picture industry, scientific visualization, and architectural rendering. Measuring the importance of input variables to a function is of use in engineering problems. For instance the performance of an aircraft, computer chip, or car part, depends on many variables describing how it is made. Engineers need to focus on the most important of those variables. Extensible quadrature is valuable for letting one keep sampling until a desired accuracy level has been reached. The project provided training for six PhD students who worked with the PI in 2013/14. Prior to that an additional 6 PhD students graduated.