Multilayer heuristics based clustering framework (MHCF) for author name disambiguation

Waqas, Humaira; Qadir, Muhammad Abdul

Multilayer heuristics based clustering framework (MHCF) for author name disambiguation

Humaira Waqas () and Muhammad Abdul Qadir ()
Additional contact information
Humaira Waqas: Capital University of Science & Technology
Muhammad Abdul Qadir: Capital University of Science & Technology

Scientometrics, 2021, vol. 126, issue 9, No 15, 7637-7678

Abstract: Abstract Author name ambiguity is a nontrivial problem currently faced by digital libraries and scholarly data search engines affecting their findings related to the authorship data provided by them. Most existing proposed solutions are complex, inflexible, feature dependent, focusing specific scenarios, rely on keyword-based similarities and ineffectively disambiguates authors with less number of citations than others (with more publications) sharing same name. All this requires a flexible name disambiguation framework that is simple, generic, context aware and can effectively disambiguate authors sharing same names but variable number of citations. In this paper we propose a multi-layer heuristics-based clustering framework. Global and structure aware features are used to group publications together using our proposed Research2vec model. Unlike many heuristics based multilayer approaches, our proposed framework uses better discriminating powered features following our proposed feature rank in an incremental fashion to minimize false positives after each merge. Also, our proposed framework unlike other similar approaches uses contextual information to group similar publications as opposed to matching same keywords. We have carefully evaluated our proposed framework using three different datasets against two word embedding based approaches, two heuristics based, two hybrid and one graph-based approach. The results clearly show our framework’s better performance than all i.e., MHCF-G (+ 5% pF1), MHCF-GL (+ 10% pF1), MDC (+ 12% pF1), HHC (+ 32% pF1), SAND-1 (+ 31%), SAND-2 (+ 22%) and GFAD (+ 18%). Our proposed solution is also evaluated on our newly proposed dataset ‘CustAND’ covering more than 11 most discriminating features unavailable in current AND datasets together. The experimental results using CustAND collection show that our framework can achieve an overall pF1 of 93.3% with only three features which further demonstrates its effectiveness.

Keywords: Author name disambiguation (AND); Name ambiguity; Digital library (DL); Unsupervised machine learning approach; Clustering distinct authors (search for similar items in EconPapers)
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2)

Downloads: (external link)
http://link.springer.com/10.1007/s11192-021-04087-7 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:126:y:2021:i:9:d:10.1007_s11192-021-04087-7

Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11192

DOI: 10.1007/s11192-021-04087-7

Access Statistics for this article

Scientometrics is currently edited by Wolfgang Glänzel

More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().