The advent of computers is routinely bringing large, very large datasets for analysis. How to analyze the attendant data and/or how to glean useful information from these data is anything but routine. This proposal investigates two broad areas, viz., mixture decomposition, and regression in the presence of taxonomy and hierarchical variables. The mixture decomposition work is concerned with data in which the observation is a distribution function in the p-dimensional Cartesian product of distributions space, rather than the single point in p-dimensional space of classical data. Such data arise naturally, or after a ggregation of original data to a more manageable size yet retaining its inherent information. A goal is to partition these distributions into coherent classes and to estimate the relevant class distributions. It is proposed to adapt ideas from copula theory for classical data, to data comprised of distributions. Embedded in this process is the need to study parameter estimation. Different partitioning techniques will be explored such as dynamical clustering, different measures of fit criterion will be considered, e.g., log-likelihood classification; and different estimation methods explored for the underlying copulas and the associated distributions and their parameters, e.g., maximum likelihood, nonparametric methods such as Parzen's truncated window. The resulting methodology will have wide applicability to those datasets generated in, e.g., meteorology, environmental science, social sciences, health-care programs, and the like. Regression methods when taxonomy variables, and when hierarchical variables, are present will also be developed. This will first be executed for classical data, and then extended to interval-valued data and to histogram- (or frequency-) valued data.

With modern computers generating very large datasets, it is imperative that techniques be developed for analysing such datasets. To date very few methods exist. The research will develop new methodologies for analysing these data. A first step is to aggregate the data in some well-defined but meaningful way. This aggregation will thence produce data in the form of lists, intervals, or distributions, and these data will now have some form of internal structure. The research will focus on such data of two types. One will be where the data now consist of distributions, and where the goal is to develop methods to identify the appropraite mixture of distributions that describe these data. Another deals with taxonomy and hierarical data, with the goal of establishing regression relationships that will explain the underlying process governing the variables involved. The resulting methodologies will allow analysis and interpretation of contemporary datastes where currently no methods exist.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0400584
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2004-05-15
Budget End
2009-04-30
Support Year
Fiscal Year
2004
Total Cost
$218,297
Indirect Cost
Name
University of Georgia
Department
Type
DUNS #
City
Athens
State
GA
Country
United States
Zip Code
30602