This research project deals with two crucial aspects of working with large sparse contingency tables: protecting the confidentiality of responses when data are shared with other researchers, and the implications of sparsity for maximum likelihood estimation in log-linear models. The first problem entails the evaluation of the disclosure risk associated with the partial release of information from a classified database, e.g., in the form of marginal tables involving subsets of variables. The second problem is concerned with developing general-purpose inferential methodologies for model selection and estimation/testing in log-linear model analysis that are appropriate for sparse categorical data. The links between these seemingly separate problems emanate from the common statistical and mathematical formalism of algebraic statistics. This research will produce new computational algorithms and sharable computer code for use by behavioral and social science researchers, as well as foundational methods and theory linking the problems of cell estimation using maximum likelihood and log-linear models and confidentiality protection. The expected outcomes of this activity will include: (1) more effective inferential procedures for the quantitative analysis and interpretation of behavioral and social science data and for the determination of the risk of disclosure; (2) statistical software for the analysis of categorical data targeted at a large audience of practitioners and researchers, which will be developed and freely distributed in the form of both computer source codes and modular, executable files; (3) more efficient numerical procedures for assessing the disclosure risk associated with the release of marginal totals.

Log-linear models analysis forms a well-established and powerful set of statistical tools for the study of categorical data, especially in the form of multi-dimentional cross-classifications or multi-way contingency tables, These models have proved to be essential for the analysis of data emanating from many areas of the social and behavioral sciences, as well as in other scientific areas. For example, in a typical sample survey, data are generated for several thousand individuals on a large number of categorical variables, measuring such information on employment, income, health status, etc. The resulting cross-classification of these variables is large, i.e., involving many thousands of cells, and sparse, i.e., most of the cell entries are either very small or contain zero counts. Similar problems arise in the study of social networks, in public health and medicine, and in the analysis of genetics databases. Recent developments in the mathematical area of algebraic geometry have provided a novel and powerful formalism for the representation of log-linear models relevant for such contingency table data. This project will use this mathematical formalism to focus on two different aspects of large sparse contingency tables: (1) Protecting the privacy of the data providers when data are shared with other users, while at the same time (2) Ensuring that such tables are useful for statistical analysis by developing new methods for log-linear model computation. The results of the project will improve access to data for secondary analysis and enhance the capacity of researchers and analysts to exploit the information in large sparse databases.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
0631589
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2006-09-15
Budget End
2011-08-31
Support Year
Fiscal Year
2006
Total Cost
$300,022
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213