Technical Description: Contemporary statistics deals with increasingly large and complex collections of data. Complementary styles are emerging for appropriately interpreting such data. One general style involves constructing multi-faceted models for the data. These should be built from relatively simpler components to treat separate aspects of the situation, but with additional modeling parameters at higher levels of the structure in order to provide connections among the components and also to provide flexibility and a measure of robustness to cushion against the possibility that the model may be overly restrictive or partially inappropriate. These additional higher order quantities may be described numerically, qualitatively, or as unknown functions depending on the context. They have various names, such as hyperparameters or latent variables, but irrespective of the terminology and of their subjective interpretation they serve comparable roles relative to the analysis of the data. Adding such higher order quantities to component models has the effect in their analysis of shrinking estimates and other forms of inference toward global averages or patterns. The major thrust of the proposal is to understand the shrinkage phenomenon from a more fundamental, componentwise perspective. Charles Stein's discovery of the advantages of shrinkage in the estimation of independent normal means is among the most surprising and important statistical developments of the preceding century. As the proposal emphasizes this discovery can be interpreted in exactly the framework of independent pieces tied together by higher level quantities. A great deal of theory has been developed over the past 50 years to rigorously understand the consequences of dealing with such a connection, and of using certain structurally appealing techniques such as those labeled as random-effects models" or hierarchical or empirical objective Bayes analyses. However this theory has failed to adequately address some issues that need to be understood before it can be adequately and properly applied in modern complex settings. For example, classical theory has tended to break groups of parameters into separate blocks and, at best, to shrink separately within each block. But it will be shown how additional shrinkage across blocks can be beneficial, and further research is proposed in this regard. Another classical deficiency that this proposal focuses on trying to correct is that current theory of shrinkage is relatively incomplete and somewhat unsatisfactory with regard to unbalanced data situations involving unequal sampling variances (heteroscedasticity), but these are typical in all highly complex modern data.

General Description: Modern statistical applications often involve massive amounts of data. Conceptual organization and interpretation of such large data sets is a fundamental challenge. It often involves modeling the data as being probabilistically dependent on parameters that control the process being investigated. Classical statistical formulations typically view individual parameters as the primitive structural quantities, rather than taking as primitives ensembles of parameters whose joint modeling characteristics are well understood and controlled. This proposal introduces the ensemble-risk to better address this issue and suggests judging estimators according to their performance relative to this ensemble-risk. Applications of the theory and methodology to be developed in this proposal include nearly all areas of science and technology, but principle applications can be identified in areas of physical and biological sciences such as genomics, climatology and astronomy where large data sets and ensembles of related parameters appear in a natural fashion. As a further instance of the range of potential applications, the orientation and conceptualization in this proposal derives in part from previous data modeling of telephone call-center traffic and internet traffic and intrusions, and a portion of the current proposal involves modeling in a different, complex context involving forecasting housing prices.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0707033
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2007-07-01
Budget End
2011-06-30
Support Year
Fiscal Year
2007
Total Cost
$389,440
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104