The objective of this research is to greatly expand the collection of statistical tools that exploit pairwise distance and dissimilarity information in statistical model building for regression, classification, and variable/pattern selection at different scales. This includes use of information that involves non-metric pairwise dissimilarity information. In this work dissimilarity information may be subjective, noisy, incomplete, confined within a nonlinear manifold, may come from multiple sources and may be inconsistent. Previous results have shown how this information may be embedded into a Euclidean space, so that methods that operate in a Euclidean space can be used. Two recent novel and very powerful tools, distance correlation and distance components, have provided for principled testing of correlations between arbitrary groups of variables and testing of equality of distributions, based only on pairwise Euclidean distances, and requiring essentially no distributional assumptions. Thus, combining methods that embed non-metric information into a Euclidean space followed by use of distance correlation and distance components that operate on Euclidean data provide an important new approach to using "messy" pairwise data. Furthermore distance correlation and distance components are being extended to certain regression, classification and variable/pattern selection problems via parametrization, tuning and testing techniques, preceded, when appropriate by embedding techniques. A series of tasks to implement aspects of this program provides advances in the major statistical tasks of regression, classification and variable/pattern selection for non-traditional information in a principled way.

This work provides a vast extension of the set of practical tools available to the statistician/data analyst and to modelers in a wide variety of scientific fields to extract information to predict, classify, and select important variables/patterns from data sets from small to large, that include distance or dissimilarity information from a variety of structures that are becoming increasingly available and important in practice. The proposed work provides a new set of important and useful tools for improved statistical data analysis that will be widely disseminated, and impact society to the extent that they provide aid to researchers in the extraction of information in biological, medical, environmental and other data sets that contain information of public interest. The project includes high level training of a Ph.D. student in an important STEM area.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1308877
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2013-08-01
Budget End
2018-07-31
Support Year
Fiscal Year
2013
Total Cost
$400,002
Indirect Cost
Name
University of Wisconsin Madison
Department
Type
DUNS #
City
Madison
State
WI
Country
United States
Zip Code
53715