This is the first component of a three-part project to create a massive microdata resource comprising the entire population of the United States in 1940. The database will provide the earliest information available on educational attainment, migration status, labor force status, and wage and salary income, hours worked per week, and weeks worked last year. Accordingly, it will provide the baseline for critical analyses of social and economic change. Researchers will be able to link recent health surveys, administrative records, and the national death index to the 1940 database, allowing study of the impact of early life conditions-including socioeconomic status, parental education, and family structure-on later health and mortality. The database will cover the entire population with full geographic detail, providing contextual information on childhood neighborhood characteristics, labor-market conditions, and environmental conditions. The new database will make a permanent and substantial addition to the nation's statistical infrastructure, and will have far-reaching implications for research across the social and behavioral sciences. The project involves (1) transcription of 7.8 billion keystrokes of data describing the demographic and economic characteristics of all individuals, families, households, and group quarters present in the United States in 1940;(2) evaluation of data quality through random blind verification and comparison with published census returns;(3) development of data dictionaries to convert approximately 2.6 million different open-ended census responses into numeric classifications compatible with previous and subsequent census data;(4) data cleaning, including editing and imputation of inconsistent and missing data values;(5) development of documentation, including full descriptions of data processing methods, detailed analysis of comparability issues, and comprehensive machine-processable metadata;(6) incorporation of the database into the Integrated Public Use Microdata Series (IPUMS) data access system for free dissemination to the scientific community;and (7) implementation of secure data protection and preservation policies. The proposed work will be carried out by a team of highly-skilled researchers with unparalleled expertise and experience in large-scale data creation, integration, and dissemination. The project is a collaboration of the Minnesota Population Center with the nation's largest producers of genealogical data and the National Archives and Records Administration. This collaboration allows a highly cost-effective use of scarce resources for shared infrastructure for population and health research.
This project provides fundamental infrastructure for health and population research, education, and policy- making. It will allow study of the impact of early life conditions-including socioeconomic status, parental education, and family structure-on later health and mortality. It will enable new kinds of spatial analysis, providing contextual informatio on childhood neighborhood characteristics, labor-market conditions, and environmental hazards. The proposed work is directly relevant to the central mission of the NIH as the steward of medical and behavioral research for the nation: the new data will advance fundamental knowledge about population health and population dynamics. The new database will make a permanent and substantial addition to the Nation's statistical infrastructure, and will have far-reaching implications for research across health and population sciences.
|Ruggles, Steven (2014) Big microdata for population research. Demography 51:287-97|
|Sobek, Matthew; Cleveland, Lara; Flood, Sarah et al. (2011) Big Data: Large-Scale Historical Infrastructure from the Minnesota Population Center. Hist Methods 44:61-68|