This project seeks to increase the availability of detailed research data about a person's neighborhood and individual characteristics, behaviors, and health outcomes, information which is crucial for research on critical national issues, such as health disparities. However, a delicate balance must be struck between providing easy access to these data and protecting the anonymity of study participants. Responding to the rising demand for contextualized microdata, large national surveys typically collect meticulous information about their subjects' personal and geographic attributes. When data are prepared for public-use files, however, much of this important detail is either suppressed or coarsened to protect the anonymity of respondents. These limitations reduce opportunities for important scientific research and impose costly burdens on producers and distributors who must implement restrictive data use agreements. Little is known about how the ability to protect a respondent's identity (i.e., disclosure risk) is affected by releasing microdata files that contain the contextual attributes of counties, tracts, blockgroups, and 1/2-mile geographic areas surrounding each subject. Considering factors that are determined at the outset of a study, it is not known how disclosure risk of contextualized microdata is affected by varying levels of sensitive information, or different sampling designs and analytical purposes. Turning to factors that are usually addressed after data collection when research files are prepared for dissemination, it is not known to what extent that disclosure risk and the scientific value of data is affected by the selection of different variables for release or application of various statistical techniques to limit disclosure. With a priori knowledge of these determinants, data producers will be able to anticipate how many and which respondents are at risk of disclosure, and adapt their data collection methods to protect them. Such adjustments will preserve and enhance the utility of the data for broad dissemination. Also, factors that affect data collection efficiencies can then be measured, allowing for the estimation of survey costs associated with modifying sampling designs to meet disclosure goals. Hence this project seeks to incorporate disclosure risk into the conceptual and empirical frameworks used in the evaluation of survey designs. In so doing, we first develop and validate models that predict the composition of survey data under different sampling designs. Next we develop measures and methods used in the assessments of disclosure risk, analytical utility, and disclosure survey costs that are best suited for evaluating sampling and database designs. Lastly we conduct simulations to gather estimates of risk, utility, and cost for studies with a wide range of sampling and database design characteristics.
Our project will increase the value and availability of scientific data by developing ways to assess, at the earliest stages of research, the risks of disclosing confidential information about study subjects. Detailed data about peoples' neighborhoods, characteristics, behaviors and health are essential for informing policy and advancing science. But a balance must be struck between providing easy access to such data and protecting confidential information. By evaluating such disclosure risks in the design phase of research, we will enhance investments in data collection and increase the value and availability of data on detailed subpopulations and their environments.