EconPapers    
Economics at your fingertips  
 

A supervised machine learning approach to author disambiguation in the Web of Science

Andreas Rehs

Journal of Informetrics, 2021, vol. 15, issue 3

Abstract: Author-level scientometric indicators are an important tool in individual and institutional-based research assessment and require high-quality author-publication profiles. To address this need, our study developed a robust supervised machine learning approach in combination with graph community detection methods to disambiguate author names in the Web of Science publication database. We used the unique author identifier Researcher ID to retrieve true authorship data of 1,904 scientists and trained a random forest and a logistic regression classifier on 1.2 million corresponding publication pairs with authors that share the same last name and first name initial. To do this, we reviewed a vast set of paper and author characteristics and randomly included missing data to make our machine learning robust to quality changes of new publication data. In the application on an unseen test set, we achieved F1 scores of 0.82 in the random forest and 0.75 in the logistic regression model. Subsequently, we evaluate feature performance and apply the infomap graph community detection algorithm to identify all publications belonging to an author. The community detection results in reasonable cluster metrics (Mean K-Metric in logistic regression-based model = 0.78 and = 0.81 in random forest-based model). Finally, we test our algorithm on a large surname-initial block (“Muller, M.”) and demonstrate speed and predictive performance.

Keywords: Author name disambiguation; Machine learning; Pairwise classification; Random forest; Community detection; Web of science (search for similar items in EconPapers)
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
http://www.sciencedirect.com/science/article/pii/S1751157721000377
Full text for ScienceDirect subscribers only

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:eee:infome:v:15:y:2021:i:3:s1751157721000377

DOI: 10.1016/j.joi.2021.101166

Access Statistics for this article

Journal of Informetrics is currently edited by Leo Egghe

More articles in Journal of Informetrics from Elsevier
Bibliographic data for series maintained by Catherine Liu ().

 
Page updated 2025-03-19
Handle: RePEc:eee:infome:v:15:y:2021:i:3:s1751157721000377