EconPapers    
Economics at your fingertips  
 

Citation‐based bootstrapping for large‐scale author disambiguation

Michael Levin, Stefan Krawczyk, Steven Bethard and Dan Jurafsky

Journal of the American Society for Information Science and Technology, 2012, vol. 63, issue 5, 1030-1047

Abstract: We present a new, two‐stage, self‐supervised algorithm for author disambiguation in large bibliographic databases. In the first “bootstrap” stage, a collection of high‐precision features is used to bootstrap a training set with positive and negative examples of coreferring authors. A supervised feature‐based classifier is then trained on the bootstrap clusters and used to cluster the authors in a larger unlabeled dataset. Our self‐supervised approach shares the advantages of unsupervised approaches (no need for expensive hand labels) as well as supervised approaches (a rich set of features that can be discriminatively trained). The algorithm disambiguates 54,000,000 author instances in Thomson Reuters' Web of Knowledge with B3 F1 of.807. We analyze parameters and features, particularly those from citation networks, which have not been deeply investigated in author disambiguation. The most important citation feature is self‐citation, which can be approximated without expensive extraction of the full network. For the supervised stage, the minor improvement due to other citation features (increasing F1 from.748 to.767) suggests they may not be worth the trouble of extracting from databases that don't already have them. A lean feature set without expensive abstract and title features performs 130 times faster with about equal F1.

Date: 2012
References: Add references at CitEc
Citations: View citations in EconPapers (7)

Downloads: (external link)
https://doi.org/10.1002/asi.22621

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bla:jamist:v:63:y:2012:i:5:p:1030-1047

Ordering information: This journal article can be ordered from
https://doi.org/10.1002/(ISSN)1532-2890

Access Statistics for this article

More articles in Journal of the American Society for Information Science and Technology from Association for Information Science & Technology
Bibliographic data for series maintained by Wiley Content Delivery ().

 
Page updated 2025-03-19
Handle: RePEc:bla:jamist:v:63:y:2012:i:5:p:1030-1047