Science and Engineering (S&E) research generates substantial returns in terms of human knowledge, social and economic benefits. Nations around the globe compete for scientific and technological leadership through substantial research funding and focused efforts to develop highly trained workforces. To date, efforts to measure and understand national and international trends in S&E and to assess global strengths and weaknesses, have largely relied on the analysis of documents such as patents and publications using big, growing datasets. But this approach too often misses or mistakenly identifies the people and teams who do productive science and engineering work. Robust indicators of the size, composition, collaboration, and mobility of the S&E workforce within and across nations are largely missing from analysis and reporting. These key aspects of the national and international scientific enterprise are poorly captured by data analysis focused on documents and citations. To address this problem, this project develops person level workforce and collaboration measures that could add granularity to comparisons of international S&E competitiveness and lead to new policy insights for S&E workforce training, hiring, and retention for a nation's future.

The prerequisite of such person level indicators is that individual researchers who appear in multiple bibliographic datasets are correctly identified and linked. Effective identification and linkage of authors based on their names is daunting because names are often ambiguous. This is particularly the case for Asian names, which poses a significant problem as Asian researchers play an increasingly important role in many fields of research. This project addresses the challenge of systematically and routinely disambiguating names in big bibliographic datasets using a new Automated and Stratified Entity Disambiguation framework. Core datasets for this effort are derived using a new method that relies on multiple data fields and an iterative process to automatically create disambiguated datasets that can be used to train artificial intelligence tools to conduct robust person level analysis. To improve disambiguation accuracy, name instances are stratified into two groups according to name-ethnicity and disambiguated separately to produce optimal models learned on the automatically generated truth data. Based on the disambiguated data, this project develops new person-level S&E indicators that characterize the landscape and trends of the international S&E research workforce across all science and engineering fields. The new big data tools for automatic disambiguation at scale will be documented and released publicly to enable expansion, validation, and reuse by the science community as well as science of science policy researchers.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Regents of the University of Michigan - Ann Arbor
Ann Arbor
United States
Zip Code