Disambiguation of author entities in ADS using supervised learning and graph theory methods
Helena Mihaljević () and
Lucía Santamaría ()
Additional contact information
Helena Mihaljević: Hochschule für Technik und Wirtschaft
Lucía Santamaría: Amazon Development Center
Scientometrics, 2021, vol. 126, issue 5, No 9, 3893-3917
Abstract:
Abstract Disambiguation of authors in digital libraries is essential for many tasks, including efficient bibliographical searches and scientometric analyses to the level of individuals. The question of how to link documents written by the same person has been given much attention by academic publishers and information retrieval researchers alike. Usual approaches rely on publications’ metadata such as affiliations, email addresses, co-authors, or scholarly topics. Lack of homogeneity in the structure of bibliographic collections and discipline-specific dissimilarities between them make the creation of general-purpose disambiguators arduous. We present an algorithm to disambiguate authorships in the Astrophysics Data System (ADS) following an established semi-supervised approach of training a classifier on authorship pairs and clustering the resulting graphs. Due to the lack of high-signal features such as email addresses and citations, we engineer additional content- and location-based features via text embeddings and named-entity recognition. We train various nonlinear tree-based classifiers and detect communities from the resulting weighted graphs through label propagation, a fast yet efficient algorithm that requires no tuning. The resulting procedure reaches reasonable complexity and offers possibilities for interpretation. We apply our method to the creation of author entities in a recent ADS snapshot. The algorithm is evaluated on 39 manually-labeled author blocks comprising 9545 authorships from 562 author profiles. Our best approach utilizes the Random Forest classifier and yields a micro- and macro-averaged BCubed $$\mathrm {F}_1$$ F 1 score of 0.95 and 0.87, respectively. We release our code and labeled data publicly to foster the development of further disambiguation procedures for ADS.
Keywords: Author name disambiguation; Record linkage; Supervised learning; Label Propagation; Information retrieval; Digital libraries (search for similar items in EconPapers)
Date: 2021
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2)
Downloads: (external link)
http://link.springer.com/10.1007/s11192-021-03951-w Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:126:y:2021:i:5:d:10.1007_s11192-021-03951-w
Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11192
DOI: 10.1007/s11192-021-03951-w
Access Statistics for this article
Scientometrics is currently edited by Wolfgang Glänzel
More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().