Census Research Data Centers (RDCs), based in Ann Arbor, Berkeley, Boston, Chicago, Durham, Ithaca, Los Angeles, New York City, and Washington provide approved scientists with access to confidential Census data for research that directly benefits both the Census Bureau and society. The RDC directors, administrators, board members and researchers, together with the Center for Economic Studies and the Longitudinal Employer-Household Dynamics (LEHD) Program, constitute a collaborative research network that is building and supporting a secure distributed computer network that enables research that is critical to our economic and civic prosperity and security. The network operates under physical security constraints dictated by Census and the Internal Revenue Service. The constraints essentially eliminate the possibility of distributing the computations to facilities outside of the Bureau's main computing facility. Instead, the researchers use the RDCs as supervised remote access facilities that provide a secure, encrypted connection to the RDC computing network.
This project addresses the technical and logistical issues raised by the creation, maintenance, and growth of the RDC network while maintaining the confidentiality guaranteed to participants in Census data. The RDCs and LEHD will lead a new wave of research with the development of innovative, large-scale linked data products that integrate Census Bureau surveys, censuses and administrative records with data from state governments and surveys conducted by private institutions. Both CES and LEHD have extensive experience in creating these products. The RDC network researchers will enhance that experience and contribute their own expertise to the data linking research. The newly created data will be richer than any presently available to researchers with no increase in respondent burden. They will also raise complicated and vexing issues regarding disclosure avoidance and participant privacy.
The project also creates synthetic versions of these confidential data sets. This will increase the accessibility of these data to social science researchers while preserving the confidentiality of private information. Synthetic and partially-synthetic data are new confidentiality protection techniques that rely on computationally intensive sampling from the posterior predictive distribution of the underlying confidential data. The result is micro-data that preserve important analytical properties of the original data and are, thus, inference-valid. The synthetic versions of confidential data are for public use. At the same time, ongoing research within the RDCs using the gold-standard confidential data will constantly test the quality of the synthesized data and allow for continuous improvement. As a result a continuous feedback relationship will be established between the research activities conducted in RDCs on confidential Census Bureau data and the quality of the Bureau's public use data products-namely, the synthetic micro data created by these projects. In order to accomplish these computationally-intensive activities, as well as to allow researchers to engage in such innovative research as agent-based simulations and geo-spatial analysis, we will install a supercluster of SMP nodes optimized for the applications of creating linked data, analyzing the gold-standard data, and processing the data to produce multiply-synthesized public use data sets. Two industry partners, Intel and Unisys, have promised to directly support the creation of this supercluster by donating 256 Itanium 2 processors and providing the computing crossbars, cluster infrastructure, and disk storage arrays at manufacturer's cost. The Linux-based system will be integrated and tuned by the proposal team from Argonne National Laboratories. The synthetic data specialists on the proposal team will port existing multi-threaded data synthesizers and develop new ones.
Broader Impacts: The research conducted in RDCs and at LEHD over the past decade has made important contributions to our understanding of essential social, economic, and environmental issues that would not have been possible without use of the confidential data accessible via the RDC network. It is difficult to overstate the significance of this research, which has used more than 30 years of longitudinally integrated establishment micro-data from the Census Business Register and Economic Censuses; confidential micro-data from all the major Census surveys (Current Population Survey, Survey of Income and Program Participation, American Housing Survey), confidential micro-data from the Decennial Censuses of Population in 1990 and 2000; longitudinally integrated Unemployment Insurance wage records, ES-202 establishment data, and Social Security Administration data; federal tax information linked to major surveys; environmental data on air quality linked to Business Register and Economic Census data; Medicaid data linked to the Survey of Income and Program Participation; and many others.