The analysis of social science data is often difficult for reasons that tend to affect other fields less substantially. One problem that is particularly difficult to handle with traditional statistical models is deliberately withheld information that correlates strongly with phenomena of interest. Such information can be thought of as unobserved clustering in the data. This project will substantially improve the current state of model-based clustering algorithms using Generalized Linear Mixed Dirichlet Models (GLMDM). The investigators' key objectives are to: (1) better understand unobserved clustering effects that are pervasive in social science datasets, notably with empirical studies of terrorism; (2) adapt GLMDM algorithms to provide substantive clusters of interest through posterior probabilities using covariate information; (3) develop an algorithmic approach that directly includes variable selection within clusters into a general clustering model; (4) speed up the simultaneous clustering and variable selection process by parallelization; and (5) distribute this technology as an easy-to-use R package for general use by others.

This project will establish a new approach for using Bayesian nonparametric methods to produce clustering based on posterior probabilities. The development of nonparametric clustering algorithms is expected to substantially improve the current state of data clustering. The algorithmic developments, which will be disseminated widely, can be applied in any scientific field and will contribute to the statistical literature on Markov chain Monte Carlo. This new approach will be applied to the empirical study of terrorism. The project also will aid in the intellectual development of students and a post-doctorate researcher who will benefit from the project's interdisciplinary focus.

Project Report

A challenge often encountered in social science studies is that information that is strongly correlated with the phenomena of interest is frequently deliberately withheld, unmeasurable, or non-collectable. The withheld information adversely affects the quality of models of the phenomena of interest because the unmeasured explanatory factors still affect the modeled relationship. Thus, the models may not include influential factors, and the modeled estimates could be biased through not accounting for this unobserved clustering. The study of terrorism has been challenging from an empirical perspective due to inherent problems in the available data. Yet, terrorism is an important problem because it affects internal government policy, public perception, relations between states, and of course, personal safety. Data challenges include the fact that successful acts of terrorism are more likely to be recorded than unsuccessful acts (a visible event), insufficient explanatory variables (information on factors related to terrorism), and lack of access to classified collections. Another key problem is that unmeasured clusters, such as regional clusters, are present in almost all terrorism data. In this work, a general clustering model (and algorithm) for analyzing such data and that directly accounts for differing variable effect with clusters was developed. Empirical studies of terrorism were used to better understand unobserved clustering effects while incorporating nonparametric error structure. The new method is able to identify unobserved clusters when other methods fail. To facilitate the use of these models, a software package in R has been developed and will be distributed freely.

Agency
National Science Foundation (NSF)
Institute
Division of Social and Economic Sciences (SES)
Application #
1028329
Program Officer
Cheryl Eavey
Project Start
Project End
Budget Start
2010-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2010
Total Cost
$162,509
Indirect Cost
Name
University of Florida
Department
Type
DUNS #
City
Gainesville
State
FL
Country
United States
Zip Code
32611