The proposed research exploits an idea of John Tukey that was never published. Called scagnostics (a Tukey neologism for "scatterplot diagnostics"), the original idea leads to a more general characterization of high-dimensional point sets using visually-based geometric and graph-theoretic measures. These measures comprise a canonical set of 9 features of pointwise data typically observed by experienced statisticians. Computing these measures on all possible 2D axis-parallel orthogonal projections in a p-dimensional space results in a p(p- 1)/2 × 9 matrix of measures. The objective of the proposed research is to generalize scagnostics to a new approach called Visual-Model-Based Transformations (VMBT). Visually-based transforms, together with multivariate analyses, can reveal visual patterns that are of interest to analysts. When interesting patterns are discovered in transform-space, one can invert the map and infer patterns in the raw data space.

Scagnostics exploits an important aspect of visualizations. A visualization can be thought of as a visual representation of an underlying mathematical model. Even simple charts of raw data rest on a model that helps (one hopes) to reveal some interesting aspect of the data. We often take these models for granted when we view familiar graphs. However, understanding mathematical models underlying visualizations can help us to devise more effective models for revealing structure in more complex datasets. Visual-Model-Based Transformations are a class of models that may prove especially effective for this purpose. Such models are motivated by visual structures perceived and processed by analysts. Given this visual motivation behind their design, visual models are likely to reveal features of data that are quite different from those appearing in common statistical and scientific graphics.

Project Report

The focus of this award was to identify models and develop software that can be used to characterize the shape and distribution of points embedded in geometric spaces. If a set of points is relatively dense (imagine a swarm of bees), then it can be seen to have a shape that is bounded by empty regions (imagine a blue-sky background). How can we identify this shape using only information about each point's location in space? The method we employed was based on scagnostics, a term coined by the statistician John Tukey. This term combines the words "scatterplot" and "diagnostics." Our innovation was to find an efficient method for computing scagnostics on many points. Thus, we were able to derive measures of clumpiness, stringyness, outlyingness, striation, and so on. The value of these measures lies in our ability to find important patterns in high-dimensional space. A high degree of clumpiness, for example, would indicate the presence of clusters in data and might invite further inquiry through statistical methods such as cluster analysis. The other innovation we introduced in this investigation was to use random projections to produce low-dimensional "pictures" of these point distributions (imagine the shadow cast by our bee swarm onto a piece of paper held in the sunlight). By characterizing the shapes of point cloud projections in low-dimensional space, we were able to conserve computing resources and enable the analysis of big datasets. We published papers primarily in the visualization community. Before we submitted our grant application, there was only one reference to the word "scagnostics" (found in one of Tukey's original talks). As of October 1, 2014, there were 3,470 references, all traceable to our research.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0808860
Program Officer
Tie Luo
Project Start
Project End
Budget Start
2008-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2008
Total Cost
$660,381
Indirect Cost
Name
University of Illinois at Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60612