Economics at your fingertips  

Large-scale name disambiguation of Chinese patent inventors (1985–2016)

Deyun Yin (), Kazuyuki Motohashi () and Jianwei Dang ()
Additional contact information
Deyun Yin: The University of Tokyo
Jianwei Dang: Tongji University

Scientometrics, 2020, vol. 122, issue 2, No 1, 765-790

Abstract: Abstract This study presents the first systematic disambiguation result of Chinese patent inventors in State Intellectual Property Office of China patent database from 1985 to 2016. With a list of 66,248 inventors owning rare names and a hand-labeled data of 1465 inventors, our supervised learning algorithm identified 3.99 million unique inventors from 1.84 million Chinese names referring to 14.68 million patent-inventor records. We developed a method for constructing high-quality training data from a third-party rare name list and provided evidence for its reliability when large-scale and representative hand-labeled data is crucial but expensive to obtain. To optimize clustering results on large-scale dataset with highly unbalanced distribution, we also modified robust single linkage by adding constraints to the maximum distance within clusters generated. Varying across different training and testing data, as well as clustering parameters, our algorithm could yield F1 scores to 93.36% before clustering and 99.10% after clustering, with final splitting errors of 1.05–1.34% and lumping errors of 0.21–0.83%. Besides, we also applied this framework in standardizing applicants’ names according to their text similarity and geographical information based on the high-resolution geocoding data of all addresses within mainland China.

Keywords: Disambiguation; Patent; Inventor; Machine learning; Gradient boosting decision tree; Single linkage (search for similar items in EconPapers)
Date: 2020
References: View references in EconPapers View complete reference list from CitEc
Citations: Track citations by RSS feed

Downloads: (external link) Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link:

Ordering information: This journal article can be ordered from

DOI: 10.1007/s11192-019-03310-w

Access Statistics for this article

Scientometrics is currently edited by Wolfgang Glänzel

More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla ().

Page updated 2020-04-24
Handle: RePEc:spr:scient:v:122:y:2020:i:2:d:10.1007_s11192-019-03310-w