This research concerns the development of innovative model comparison and model evaluation methods that focus on the most relevant features of the data and that are robust to deficiencies of model and data. The advent of modern, automated technology for data collection and of cheap, near-boundless storage capacity provides access to a previously undreamt wealth of data. The parallel development of sophisticated models which allow one to combine many sources of information and the computational strategies and horsepower which allow one to fit the models would seem to facilitate near-perfect decision-making. However, the wealth of data aggravates the problems caused by data contamination, and the complexity of model aggravates the difficulty of specifying the prior on the parameters. Handling data contamination and constructing methods robust to the lack of prior information pose fundamental statistical challenges. In this project, the investigators illustrate the deficiency of the current leading model evaluation/model comparison methods, and then propose a set of new tools to alleviate the problems. The proposed research consists of the following two specific aims. 1. To develop reliable methods for Bayesian model comparison when prior information is lacking. The common practices of using an improper noninformative prior distribution or a vague proper prior distribution are effective in estimation, however they break down for Bayesian hypothesis testing problems where model choice is sensitive to details of the prior distribution. To tackle this difficulty, the investigators propose a remedy, the calibrated Bayes factor. The calibrated Bayes factor does not need extensive subjective evaluation, yields an analysis that better mimics the performance of the Bayes factor under a ''reasonable default'' prior, and is widely applicable in a large variety of model comparison problems. 2. To develop robust methods for model evaluation and model fitting in the presence of contaminated data. Contaminated data comes in many forms, including observations potentially from recording mistakes or from irrelevant populations. The contaminating process might be unstable, which makes standard statistical modeling infeasible. For model fitting, this project develops and implements restricted-likelihood, which leads to estimation strategies that focus on the most relevant features of the data and that are robust to ''bad data''. For model evaluation, this project develops an adaptive loss (scoring) paradigm for cross-validation, which produces robust results and yields superior finite-sample performance by stabilizing the evaluation.

Model evaluation and model comparison are used on a daily basis in both scientific and corporate decision-making settings. These techniques help researchers judge which theory best describes the phenomenon, help health professionals identify which risk factors are related to disease incidence, and help corporate managers decide which business strategy results in increased sales or better customer retention. However, most of the current model evaluation and model comparison methods neglect deficiencies in data or suffer from the lack of parameter information. The proposed research provides powerful methodological tools for robust model preference, model evaluation, and model fitting in these difficult situations. It can help people in various fields better extract information from massive data sets, and thus optimize their decision making. Specific applications in health studies, psychological experiments and machine learning will proceed along with development of the new methodology. The general methodology is also applicable to many other scientific and technical areas, such as genomics, climatology, and economics, where large data sets are collected and robust model evaluation is desirable. The investigators are well-positioned to disseminate the project's results. They have been actively involved in research groups at the intersection of Statistics and the social sciences, engineering/computer science, and marketing. They are also key members of a joint industry-university center dedicated to provide and disseminate research relevant to the insurance industry. Results from this project will be spread to other communities through the investigators' interactions with these groups.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1209194
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2012-09-01
Budget End
2016-08-31
Support Year
Fiscal Year
2012
Total Cost
$320,000
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210