This proposal seeks funding to expand the Integrated Public Use Micro-data Series (IPUMS) by adding demographic and geographic data describing the entire enumerated population of the U.S. from 1790 to 1930. The project will provide data on the characteristics of over 600 million persons, quadrupling the quantity of U.S. census micro-data available for scientific research. The data will cover entire populations with full geographic detail, providing contextual information on neighborhood characteristics, including ethnic composition, demographic behavior, and population mobility. These data offer the earliest information available on key social and economic characteristics, and they will provide invaluable insight into processes of long-run demographic and economic change. The data will make a permanent and substantial addition to the nation's statistical infrastructure and will have far-reaching implications for research across the social and behavioral sciences. The project is made possible by the donation of a massive high-quality verified transcription of information in the U.S. censuses, prepared by two major genealogical organizations. Converting this immense body of raw data into a format suitable for scientific analysis will require the following tasks: () classify and code geographic locations to be compatible with categories used in the published census returns; (2) assess completeness and accuracy of the data transcription; (3) convert alphabetic string data into numeric categories that are comparable over time; (4) employ new data cleaning software to identify and correct common enumeration and transcription errors; (5) develop documentation, including full descriptions of data processing methods, detailed analysis of comparability issues, and comprehensive machine-processable metadata; (6) incorporate the data into the IPUMS data access system for free dissemination to the scientific community; and (7) implement secure data protection and preservation policies. The project will be executed by a team of highly-experienced researchers with exceptional proficiency in large- scale data creation, integration, and dissemination and will leverage cutting-edge technology to process an unprecedented volume of data at reasonable cost. The project is a collaboration of the Minnesota Population Center with the world's largest producers of genealogical data, allowing cost-effective use of scarce resources to develop shared infrastructure for population and health research.

Public Health Relevance

This project will provide basic infrastructure for health and population research, education, and policy-making. It will allow research on fertility, mortality, family composition, life-course transitions, mobility, and the impact of neighborhood conditions on demographic behavior. The proposed work is directly relevant to the central mission of the NIH as the steward of medical and behavioral research for the nation: the new data will advance fundamental knowledge about population health and population dynamics and will spawn new methods of spatiotemporal analysis that can deepen understanding of the ongoing transformations of American society.

National Institute of Health (NIH)
Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-PSE-B (90)S)
Program Officer
Bures, Regina M
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Minnesota Twin Cities
Organized Research Units
United States
Zip Code
Kugler, Tracy A; Fitch, Catherine A (2018) Interoperable and accessible census and survey data from IPUMS. Sci Data 5:180007
Roberts, Evan; Warren, John Robert (2017) Family structure and childhood anthropometry in Saint Paul, Minnesota in 1918. Hist Fam 22:258-290
Ruggles, Steven (2014) Big microdata for population research. Demography 51:287-97