EconPapers    
Economics at your fingertips  
 

Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus

Mehmet Ali Abdulhayoglu () and Bart Thijs ()
Additional contact information
Mehmet Ali Abdulhayoglu: KU Leuven
Bart Thijs: KU Leuven

Scientometrics, 2018, vol. 116, issue 2, No 30, 1229-1245

Abstract: Abstract A novel hashing algorithm is applied to match two prominent and important bibliographic databases at the paper level. In the literature, such tasks have been studied and conducted many times, but relying only on journal information due to massive volume of indexed publications. As a result of paper based match, missing or erroneous items can be completed from other source or the overlap can be measured more reliably. In this context, we focus on measuring the overlap between Clarivate Analytics Web of Science (WoS) and Elsevier’s Scopus at the paper level. Our focus is on detecting exact matches, that is, no false positives are tolerated at all. To this end, we follow a twofold matching procedure. First, a locality sensitive hashing algorithm is applied, which provides fast approximate nearest neighbours and similarities, in order to obtain WoS-Scopus pair suggestions. Second, for each suggested pair, different heuristics are applied to identify those pair of records that indeed refer to the same publication. We observe that at least 74% of WoS publications are also indexed by Scopus. The percentage increases to 92% when only the cited publications are retained. The overlapped WoS records are also presented based on Institute for Scientific Information subject categories (SC). Of those, three big SCs, whose overlap ratios are relatively low, are chosen and examined in detail. Last but not the least, it takes just about an hour to match 14.2 million versus 19.6 million publications from a publication year range of 2004–2013 in a high performance computer environment.

Keywords: Locality sensitive hashing (LSH); Character n-grams; Information retrieval from bibliographic databases; Bibliographic database overlap; Text matching (search for similar items in EconPapers)
Date: 2018
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (10)

Downloads: (external link)
http://link.springer.com/10.1007/s11192-017-2569-6 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-017-2569-6

Ordering information: This journal article can be ordered from
http://www.springer.com/economics/journal/11192

DOI: 10.1007/s11192-017-2569-6

Access Statistics for this article

Scientometrics is currently edited by Wolfgang Glänzel

More articles in Scientometrics from Springer, Akadémiai Kiadó
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-017-2569-6