Clean data is prerequisite for most statistical analyses. An ideal solution, when questionable data items arise, is to go back to check the source. However, in many cases this is not possible. Contamination therefore is an important problem, and robust techniques that can handle large data sets are needed to cope with this problem. The role of distances in the problem of robust model selection is examined. It is argued that robust model assessment and selection is an important problem that has not received adequate attention in the literature. It is suggested that distances offer potentially valuable tools for addressing various aspects of the problem of modeling, one of which is the aspect of robustness. A new framework is proposed that differs from the classical robustness paradigm in at least two aspects. Most of the developments in classical robustness center around location-scale models and the concepts therefrom. Attempts to extend classical robust procedures to other non location-scale models were met with limited success. The methodology proposed here incorporates easily a wide variety of models, including location-scale models. The starting point of the new proposal is the identification of a goodness-of-fit measure that provides an assessment of whether a given model approximates the mechanism that generated the data. It is then examined in what sense the measure is robust.

Distances have been used extensively in many scientific fields such as genetics, physics, sociology, anthropology and more recently in the field of machine learning. The significance of this work is two-fold. Within the scientific field of statistics, a very general framework, that can address the problem of robust model assessment and selection is offered, that can handle large data sets and allows to measure the extend to which the model approximates the phenomenon under study. Outside the field of statistics the technology and scientific results can be extended and applied to address important problems in clinical informatics and bioequivalence.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0504957
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2005-07-01
Budget End
2009-06-30
Support Year
Fiscal Year
2005
Total Cost
$250,001
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027