The analysis of social science data is often difficult for reasons that tend to affect other fields less substantially. One problem that is particularly difficult to handle with traditional statistical models is deliberately withheld information that correlates strongly with phenomena of interest. Such information can be thought of as unobserved clustering in the data. This project will substantially improve the current state of model-based clustering algorithms using Generalized Linear Mixed Dirichlet Models (GLMDM). The investigators' key objectives are to: (1) better understand unobserved clustering effects that are pervasive in social science datasets, notably with empirical studies of terrorism; (2) adapt GLMDM algorithms to provide substantive clusters of interest through posterior probabilities using covariate information; (3) develop an algorithmic approach that directly includes variable selection within clusters into a general clustering model; (4) speed up the simultaneous clustering and variable selection process by parallelization; and (5) distribute this technology as an easy-to-use R package for general use by others.

This project will establish a new approach for using Bayesian nonparametric methods to produce clustering based on posterior probabilities. The development of nonparametric clustering algorithms is expected to substantially improve the current state of data clustering. The algorithmic developments, which will be disseminated widely, can be applied in any scientific field and will contribute to the statistical literature on Markov chain Monte Carlo. This new approach will be applied to the empirical study of terrorism. The project also will aid in the intellectual development of students and a post-doctorate researcher who will benefit from the project's interdisciplinary focus.

Project Report

The analysis of social science data is often difficult for reasons that tend to affect other fields less substantially. Suchproblems include: high levels of measurement error, governments that falsify or withhold information, collection in difficult oreven violent areas, embargoed information based on privacy concerns, well-known survey response issues, overlapping explanatorypower in model variables, the fluidity of political and social institutions, as well as the willingness of individuals to concealinformation from researchers. This has led to many important modeling innovations as a way to meet these challenges. Here we areconcerned with deliberately withheld information that correlates strongly with some phenomena in themodel, although the sameproblem can arise from mis-coded data. This creates biased inferences with standard approaches and leads to erroneous conclusionsabout the key explanations of interest. In our case, we are concerned with perhaps the worst instance of this problem: studyingterrorist groups who obfuscate, deceive, and even kill, as a means of denying observers reliable data. An important part of this problem is latent heterogeneity that arises from unknown grouping in the data, and several methods havebeen developed to produce reliable models to account for such unknown clustering. This work introduces a new model-based clustering design which incorporates two sources of heterogeneity for the modeling of social science and biomedical data. Thefirst source of heterogeneity is in the residuals from the mean structure and is modeled with Dirichlet Process random effects.The second source of heterogeneity is unobserved grouping in the data, which is modeled using the product partition framework.Incorporating both sources of inhomogeneity allows the model to capture both structural differences in response to the covariatesas well as departures from normality in the error structure. The model is applied to the analysis of terrorist groups, which shows how this tool reveals important features in a dataset that are otherwise undetectable.

Agency
National Science Foundation (NSF)
Institute
Division of Social and Economic Sciences (SES)
Application #
1028314
Program Officer
Cheryl Eavey
Project Start
Project End
Budget Start
2010-10-01
Budget End
2014-09-30
Support Year
Fiscal Year
2010
Total Cost
$162,497
Indirect Cost
Name
Washington University
Department
Type
DUNS #
City
Saint Louis
State
MO
Country
United States
Zip Code
63130