Inventor Name Disambiguation with Gradient Boosting Decision Tree and Inventor Mobility in China (1985-2016)
Deyun Yin () and
Kazuyuki Motohashi ()
Discussion papers from Research Institute of Economy, Trade and Industry (RIETI)
This paper presents the first systematic disambiguation result of all Chinese patent inventors in the State Intellectual Property Office of China (SIPO) patent database from 1985 to 2016. We provide a method of constructing high-qualitative training data from lists of rare names and evidence for the reliability of these generated labels when large-scale and representative hand-labeled data are crucial but expensive, prone to error, and even impossible to obtain. We then compare the performances of seven supervised models, i.e., naive Bayes, logistic, linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA), as well as tree-based methods (random forest, AdaBoost, and gradient boosting decision trees), and found that gradient boosting classifier outperforms all other classifiers with the highest F1-score and stable performance in solving the homonym problem prevailing in Chinese names. In the last step, instead of adopting the more popular hierarchical clustering method, we clustered records with the density-based spatial clustering of applications with noise (DBSCAN) based on the distance matrix predicated by the GBDT classifier. Varying across different testing data and parameters of DBSCAN, our algorithm yielded a F1-score ranging from 93.5%-99.3% with splitting error within the range 0.5%-3% and lumping error between 0.056%-0.37%. Based on our disambiguated result, we provide an overview of Chinese inventors' regional mobility.
Pages: 56 pages
New Economics Papers: this item is included in nep-big, nep-cmp, nep-cna and nep-tra
References: View references in EconPapers View complete reference list from CitEc
Citations: Track citations by RSS feed
Downloads: (external link)
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:eti:dpaper:18018
Access Statistics for this paper
More papers in Discussion papers from Research Institute of Economy, Trade and Industry (RIETI) Contact information at EDIRC.
Bibliographic data for series maintained by TANIMOTO, Toko ().