This application seeks funding to create a complete set of microdata describing socioeconomic characteristics of the U.S. population in 1940. The project will digitize critical information on income, education, housing, and employment, greatly increasing the usefulness of the 1940 census for answering fundamental scientific questions about health and demographic change. The 1940 census was the first to collect information on years of schooling completed, wage and salary income, hours worked last week, and weeks worked last year. Data on parental income and education are essential for assessing childhood socioeconomic status. Accordingly, these indicators will be invaluable for assessing the role of early life conditions on health outcomes. Because the database will cover the entire population with full geographic detail, it will provide contextual information on childhood neighborhood characteristics, including labor-market conditions. More broadly, because these data offer the earliest information available on key social and economic characteristics, they will provide an important baseline for studies of demographic and economic change. The socioeconomic variables will make a permanent and substantial addition to the nation's statistical infrastructure and will have far-reaching implications for research across the social and behavioral sciences. The project involves (1) transcription of over one billion keystrokes of data describing socioeconomic characteristics of all individuals present in the United States in 1940;(2) evaluation of data quality through random blind verification and comparison with published census returns;(3) data cleaning, including editing and imputation of inconsistent and missing data values;(4) development of a data dictionary to convert approximately 80,000 different open-ended descriptions of institutions into numeric classifications compatible with previous and subsequent census data;(5) development of documentation, including full descriptions of data processing methods, detailed analysis of comparability issues, and comprehensive machine-processable metadata;(6) incorporation of the additional variables into the Integrated Public Use Microdata Series (IPUMS) data access system for free dissemination to the scientific community;and (7) implementation of secure data protection and preservation policies. The project will be executed by a team of highly-experienced researchers with exceptional proficiency in large- scale data creation, integration, and dissemination. The project is a collaboration of the Minnesota Population Center with the nation's largest producer of genealogical data, the Census Bureau, and the National Archives and Records Administration. This collaboration allows a cost-effective use of scarce resources to develop shared infrastructure for population and health research.

Public Health Relevance

This project will provide basic infrastructure for health and population research, education, and policy-making. It will allow study of the impact of early life conditions-including parental income and education-on later health and mortality. It will enable new kinds of spatial analysis, providing contextual information on childhood neighborhood characteristics, including labor-market conditions. The proposed work is directly relevant to the central mission of the NIH as the steward of medical and behavioral research for the nation: the new data will advance fundamental knowledge about population health and population dynamics.

National Institute of Health (NIH)
Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Clark, Rebecca L
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Minnesota Twin Cities
Schools of Arts and Sciences
United States
Zip Code
Ruggles, Steven (2014) Big microdata for population research. Demography 51:287-97