Automatic noise reduction of domain-specific bibliographic datasets using positive-unlabeled learning
Guo Chen,
Jing Chen,
Yu Shao and
Lu Xiao ()
Additional contact information
Guo Chen: Nanjing University of Science and Technology
Jing Chen: Nanjing University of Science and Technology
Yu Shao: Northwest Engineering Corporation Limited
Lu Xiao: Nanjing University of Finance and Economics
Scientometrics, 2023, vol. 128, issue 2, No 13, 1187-1204
Abstract:
Abstract Constructing a bibliographic dataset is fundamental for domain analysis in bibliometric research. However, irrelevant documents(so-called “impurities”) in the initial domain dataset are inevitable and difficult to identify, requiring considerable human efforts to eliminate. To solve this problem, we propose a weak-supervised noise reduction approach based on the Positive-Unlabeled Learning (PU-Learning) algorithm to clean the initial bibliographic dataset automatically. The basic idea is to use a batch of “absolutely positive sample sets” already available in the dataset to obtain a collection of “reliable negative sample sets,” based on which a training set can be constructed for the downstream supervised classification. This paper conducted a comparative experiment using the Artificial Intelligence (AI) domain of the US National Technical Reports Library (NTIS) report as an example. We compared schemes with different variables to explore the influence of various technical aspects on the final noise reduction performance. Our approach achieved significant improvements compared with the similarity-based unsupervised baseline; the recall rose from 0.3742 to 0.8103, and the precision rose from 0.6621 to 0.7383. We found that the impact of document representation algorithms is crucial while classification strategies and s_ratio in PU-Learning are not. Our approach needs no manual annotation data and thus can provide powerful help for bibliometric researchers to construct high-quality bibliographic datasets.
Keywords: Domain analysis; Bibliographic dataset; Noise reduction; PU-learning (search for similar items in EconPapers)
Date: 2023
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
http://link.springer.com/10.1007/s11192-022-04598-x Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:128:y:2023:i:2:d:10.1007_s11192-022-04598-x
Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11192
DOI: 10.1007/s11192-022-04598-x
Access Statistics for this article
Scientometrics is currently edited by Wolfgang Glänzel
More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().