It is common statistical practice to employ a linear, least squares analysis, even though the assumptions justifying the inference from such an analysis are not valid. The current goal is to provide a guide to useful inferential statements that can be justifiably used in the face of this common dilemma. For this purpose, begin with a nearly model-free description of the data generating process: the observations are a random sample from a population consisting of vector-valued covariates and accompanying real responses. Inference is in the form of a linear description of the relation between the response and the covariates. The target of inference is the, suitably defined, best linear description of the population dependence of the response on the covariates. A first task is to provide a mathematical framework to accurately describe such a situation and enable its rigorous analysis. Within this formulation possible forms of inference can then be investigated. It can be shown that the conventional sample least-squares estimate of the linear coefficients has certain asymptotic optimality properties. But the conventional standard errors and confidence intervals for these coefficients are in general not asymptotically correct. Correct asymptotic inference is provided by suitable forms of either the bootstrap or the so-called sandwich estimator. The current research will discuss variations of these inferential procedures. It will also describe situations in which these inferential procedures provide trustworthy results for realistic sample sizes. The model-free perspective leads to additional understanding of other important statistical settings involving possibly informative covariates. One of these relates to Randomized Clinical Trials in which interest centers on the Average Treatment Effect, compared to that of a placebo or alternate, standard treatment. The general formulation suggests use of a new estimator and related inference, and the properties and variants of this will be investigated.

Statistical practice is built on and justified by corresponding statistical theory. That theory has a common overarching paradigm: Statistical data is observed. A statistical description is adopted for this data and the data is then analyzed according to this statistical description. There is a presumption in this paradigm that the analytic model agrees sufficiently well with the actual model that generated the data. This is often not the case in practice. The current proposal builds a new, coherent theory that goes beyond the common paradigm in that it allows the statistical model for the data and the model for the analysis to be very different. The effect of the proposed research should be to first warn practitioners of often encountered but rarely recognized dangers. These are inherent in the common practice of using linear models such as regression analysis and ANOVA when they may not sufficiently accurately represent the true nature of the statistical sample. It will then provide alternate forms of inference that are valid and can be responsibly utilized in such situations. To complement its theoretical, methodological orientation the research maintains close connections with applications through the applied activities of several of the senior investigators in diverse areas including social science - especially criminology -operations research and health care.

Project Report

Statistical analyses are based on probabilistic models. These models all make assumptions about the nature of the data. Many applications of statistical methods involve statistical models about the relationship of several quantities (sometimes called "covariates") to some outcome of interest. The covariates are the potential predictive variables (perhaps age, sex, height, socio-economic status, etc.). A goal of such analyses is to predict some numerical outcome variable (maybe blood pressure, or length-of-life, etc.), and more generally to understand the relation between the potential predictors and the dependent outcome variable. A common assumption in such regression models is that the true relation between predictors and outcomes is linear with an additional additive random residual term. Further, more technical, assumptions are often also made on this residual term. (These are the assumptions of "normality" and "homoscedasticity of errors" that are familiar in basic regression models.) What happens if the analysis proceeds on this basis but one or more of these assumptions are not valid? A fundamental goal of this proposal was to provide an answer to this question by properly interpreting the resulting linear-model statement when the truth is not linear. In a wide variety of applications the data are observational, and the potential covariates are themselves random variables; this randomness is an important factor in our answer to the fundamental question. The basic elements needed for constructing the answer all exist in previous research and in appropriate elements of existing statistical practice, but these need to be carefully combined and used. A major product of our research is a paper that explains the basic background, the correct interpretation of linear analyses in the face of non-linear reality (and possibly of other realities that do not agree with the statistical model assumptions). This paper also explains some additional steps that can be taken, which go beyond existing theory and methodology. The interpretations proposed in this basic paper lead to some new perspectives on more specific types of statistical inference. One of these is the estimation of average treatment effect in randomized clinical trials. A paper was completed and submitted on this issue. We have also capitalized on this re-interpretation and modified understanding of traditional methodology in order to begin investigation of improved methodologies for some other statistical settings, including applications inside the umbrella of "big data" and "machine learning". This promising research has continued beyond the conclusion of the current proposal. The basic perspectives and paradigms we propose are somewhat different from those customarily taught in basic statistics courses. Because of this we have been developing new educational materials designed to convey these at appropriate stages within the statistical curriculum.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1310795
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2013-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2013
Total Cost
$199,395
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104