The analysis of massive data sets now commonly arising in scientific investigations poses many statistical challenges not present in smaller scale studies. Many of these data sets exhibit sparsity where most of the data corresponds to noise and only a small fraction is of interest. In such situations mixture models can provide an effective and convenient framework for a wide variety of problems. The investigators propose to develop a comprehensive methodology for mixture models in sparse settings. There are four main goals to be pursued in moderately sparse and super sparse environments. The first is to make precise how well sparsity can be estimated as well as to develop a general methodology for estimating sparsity. A second goal is to develop a data dependent thresholding rule which with high probability yields a collection of cases almost all of which correspond to signal. The investigators also plan to develop a theory which makes precise the possible tradeoffs between discovering signal and including noise. A third goal is to develop a theory of optimal detection for sparse mixture models. A final goal is to provide a theoretical basis for connecting mixture models with sequence models another useful framework for analyzing sparse data.

The proposed research on sparse inference will provide technical tools as well as methodology, to researchers in other scientific fields who collect and analyze large data sets with sparse signals. These fields include astronomy, bioinformatics, biostatistics, and genetics. The procedures and algorithms will be implemented in Splus or Matlab and made available on the Internet along with the associated research reports so as to facilitate comparisons with other approaches.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0604954
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2006-09-01
Budget End
2009-08-31
Support Year
Fiscal Year
2006
Total Cost
$353,380
Indirect Cost
Name
University of Pennsylvania
Department
Type
DUNS #
City
Philadelphia
State
PA
Country
United States
Zip Code
19104