Datasets that are complex with the data themselves "complex", and/or with structures that impose complications) are becoming more and more routine with the impact of contemporary computer capacity. What is not routine is how to analyse these data. Indeed, the data "collection" is fast outpacing the ability to analyse them. It is evident that, even in those situations where in theory available methodology might seem to apply, routine use of such statistical techniques is often inappropriate. Some methods (e.g. squashing) take representative "`samples"' and then use standard procedures on the sampled data. Others seek sub/patterns (e.g., data mining) and then try to focus on the data behind those patterns. Others aggregate the data in some meaningful way. One such aggregation method produces so-called symbolic data (such as lists, intervals, distributions, etc.). An advantage of symbolic data is that unlike those in sampled sets, a symbolic-value retains all the original data, while simultaneously reducing the size of the dataset. Further while the massive datasets encountered today are one source of symbolic data, there are many data that are naturally symbolic (be these small or large datasets). All are better analysed by methods developed for symbolic data. The investigator addresses three major areas. One area is classification trees. Here, distances measures for interval and histogram-valued data are developed; and then they are used in new algorithms which extend the classical CART methodolgy to symbolic data. Secondly, regression methods, in particular, logistic regression and Cox's proportional hazard models, are adapted to symbolic data. Finally, factor analysis and principal component methodoly is developed for symbolic data.

With the impact of contemporary computer capacity, datasets that are complex with the data themselves "complex" are becoming more ubiquitous. Yet those same computers often lack the capacity to analyse these massive datasets. Therefore, new ways to handle them must be developed. One way is to aggregate the data in a scientifically meaningful way (with the actual aggregation being dictated by the question at hand). Such aggregation will necessarily produce data that form lists, intervals, histograms, etc. The investigator develops new methodologies for interval data in three major areas, classification trees after rst nding distance measures for intervals and histograms, regression methods especially logistic regression, and factor analysis. The results are applied to data. A synergism is achieved by the integration of mathematical/ statistical/computational arenas in addressing real issues encountered by contemporary datasets. The outcomes cannot be achieved by the tools of just one of these disciplines but needs all three. The new methodologies will have wide applicability to those datasets generated in, e.g., meteorology, environmental science, social sciences, health-care programs, industry, and the like, well beyond those motivating the work. This will have enormous impact on US science. Further since doctoral students will be engaged as collaborators and since international researchers will be active participants, the research helps in the internationalization of the next and future generation of US scientists.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0805245
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2008-08-01
Budget End
2012-07-31
Support Year
Fiscal Year
2008
Total Cost
$150,000
Indirect Cost
Name
University of Georgia
Department
Type
DUNS #
City
Athens
State
GA
Country
United States
Zip Code
30602