The need to analyze multivariate data arises in many diverse disciplines, such as computer science and engineering, signal processing, psychology, meteorology, chemometrics, sociology, and biology. Due to the changing methods for collecting data, the number of variables or attributes measured in a single observation or for a single subject are becoming exceptionally large, and they can be considerably larger than the number of observations or subjects themselves. Such data sets are commonly referred to as large sparse data sets. For such data sets, the possibility of recording bad data points or outliers is increasingly likely. Outliers tend to have a disproportionate impact on the interpretation of the data unless one uses robust methods, that is, methods that can accommodate bad data. Developing methods to analyze large sparse data sets has become a major research topic within the field of statistics. There has been, however, relatively little attention given to the development of robust methods for large sparse data sets, which is the primary goal of this research project. The research project aims to produce fundamental results, theoretical approaches and statistical methods applicable to the robust analysis of large sparse data sets, upon which other researchers can build.
Most robust multivariate statistical methods are mainly applicable whenever the sample size is considerably larger than the number of variables, and are not particularly applicable to large sparse data sets. In particular, for sample sizes that are modest relative to the number of variables, robust affine equivariant estimates of multivariate location and scatter are similar in performance to the classical sample mean vector and sample covariance matrix, and consequently do not yield robust results for such data sets. Analyzing relatively sparse multivariate data tends to require either presuming certain covariance structures, such as those arising in graphical models, factor analysis or other reduced rank models, or developing methods which give preference to certain covariance structures via regularization methods. These special covariance structures are usually not considered in most robust multivariate methods. To address this shortcoming, the research project aims to develop robust methods which take into account a presumed covariance structure, and in particular to develop and study direct M-estimation methods and S-estimation methods for structured covariance models, as well as to develop and study penalized M-estimates of the covariance matrix. Addressing robustness issues for structured covariance models and for penalization methods are fundamental problems which is more mathematically and computationally challenging than in the classical setting or in the unrestricted robust estimation setting. Here, some recent work on geodesic convexity within the signal processing community is expected to play an important role in addressing these problems.