It is common statistical practice to employ a linear, least squares analysis, even though the assumptions justifying the inference from such an analysis are not valid. The current goal is to provide a guide to useful inferential statements that can be justifiably used in the face of this common dilemma. For this purpose, begin with a nearly model-free description of the data generating process: the observations are a random sample from a population consisting of vector-valued covariates and accompanying real responses. Inference is in the form of a linear description of the relation between the response and the covariates. The target of inference is the, suitably defined, best linear description of the population dependence of the response on the covariates. A first task is to provide a mathematical framework to accurately describe such a situation and enable its rigorous analysis. Within this formulation possible forms of inference can then be investigated. It can be shown that the conventional sample least-squares estimate of the linear coefficients has certain asymptotic optimality properties. But the conventional standard errors and confidence intervals for these coefficients are in general not asymptotically correct. Correct asymptotic inference is provided by suitable forms of either the bootstrap or the so-called sandwich estimator. The current research will discuss variations of these inferential procedures. It will also describe situations in which these inferential procedures provide trustworthy results for realistic sample sizes. The model-free perspective leads to additional understanding of other important statistical settings involving possibly informative covariates. One of these relates to Randomized Clinical Trials in which interest centers on the Average Treatment Effect, compared to that of a placebo or alternate, standard treatment. The general formulation suggests use of a new estimator and related inference, and the properties and variants of this will be investigated.

Statistical practice is built on and justified by corresponding statistical theory. That theory has a common overarching paradigm: Statistical data is observed. A statistical description is adopted for this data and the data is then analyzed according to this statistical description. There is a presumption in this paradigm that the analytic model agrees sufficiently well with the actual model that generated the data. This is often not the case in practice. The current proposal builds a new, coherent theory that goes beyond the common paradigm in that it allows the statistical model for the data and the model for the analysis to be very different. The effect of the proposed research should be to first warn practitioners of often encountered but rarely recognized dangers. These are inherent in the common practice of using linear models such as regression analysis and ANOVA when they may not sufficiently accurately represent the true nature of the statistical sample. It will then provide alternate forms of inference that are valid and can be responsibly utilized in such situations. To complement its theoretical, methodological orientation the research maintains close connections with applications through the applied activities of several of the senior investigators in diverse areas including social science - especially criminology -operations research and health care.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Standard Grant (Standard)
Application #
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of North Carolina Chapel Hill
Chapel Hill
United States
Zip Code