The current inability to identify which papers bearing the same author name (last name, first initial) are written by different individuals is an impediment to user retrieval of health-related information as well as research devoted to understanding the publication and collaboration behavior of biomedical scientists. Disambiguation of author names will help in scientometrics and health policy studies, as well as everyday scientific tasks of numerous kinds: for example, choosing referees and conference attendees. We have created a probabilistic model of how the attributes of Medline articles vary across authors, and hypothesize that this can serve as the basis for disambiguating author names in Medline. In this exploratory two-year study, it is proposed: 1. To create and evaluate a database of """"""""author-individuals"""""""" that lists all of the papers in Medline and assigns the great majority of them to one or more specific author-individuals with high confidence. A probabilistic model based on Medline record fields will be refined which estimates, for any two papers bearing the same name, the probability that they were written by the same individual, including supplementary information such as author first names and affiliations for all authors. Then, clustering algorithms will be optimized and applied to form author-individual clusters for all names in Medline. 2. To update the author-individual database (weekly) and underlying probabilistic model (yearly), and to create and evaluate a free, public, multi-user query interface. The database will also be made available to academic researchers for bibliometric, scientometric and policy studies. This research will set the stage for more in-depth studies of publication and collaboration behavior in the future that should give valuable insights into ways to increase scientific productivity in biomedical sciences.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Exploratory/Developmental Grants (R21)
Project #
5R21LM008364-02
Application #
7008844
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Sim, Hua-Chuan
Project Start
2005-01-15
Project End
2008-07-31
Budget Start
2006-01-15
Budget End
2008-07-31
Support Year
2
Fiscal Year
2006
Total Cost
$172,502
Indirect Cost
Name
University of Illinois at Chicago
Department
Psychiatry
Type
Schools of Medicine
DUNS #
098987217
City
Chicago
State
IL
Country
United States
Zip Code
60612
Smalheiser, Neil R; Torvik, Vetle I; Zhou, Wei (2009) Arrowsmith two-node search interface: a tutorial on finding meaningful links between two disparate sets of articles in MEDLINE. Comput Methods Programs Biomed 94:190-7
Torvik, Vetle I; Smalheiser, Neil R (2009) Author Name Disambiguation in MEDLINE. ACM Trans Knowl Discov Data 3:
Torvik, Vetle I; Smalheiser, Neil R (2007) A quantitative model for linking two disparate sets of articles in MEDLINE. Bioinformatics 23:1658-65
Zhou, Wei; Torvik, Vetle I; Smalheiser, Neil R (2006) ADAM: another database of abbreviations in MEDLINE. Bioinformatics 22:2813-8