The SearchEngine: A holistic approach to matching
Thorsten Doherr
No 23-001, ZEW Discussion Papers from ZEW - Leibniz Centre for European Economic Research
Abstract:
The SearchEngine is an open source project providing an integrated framework for diverse matching activities, especially the linkage of large scale firm data by fuzzy criteria like company names and addresses. At its core, it utilizes an efficient candidate retrieval mechanism implementing a word respectively token driven heuristic. Every record in one table becomes a search term to retrieve similar candidate records in the base table according to a search strategy replacing blocking strategies of conventional matching efforts. Because similarity is inherently established by the candidate selection, it is only required to filter false positives by using the meta data export file derived from the matching heuristic to implement a machine learning approach. This paper discusses the general foundation of the heuristic and the algorithm while two detailed walkthroughs of company linkages show practical examples.
Keywords: data linkage; firm matching; entity resolution; machine learning (search for similar items in EconPapers)
JEL-codes: C81 C88 (search for similar items in EconPapers)
Date: 2023
New Economics Papers: this item is included in nep-cmp
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (4)
Downloads: (external link)
https://www.econstor.eu/bitstream/10419/268428/1/1832674266.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:zbw:zewdip:23001
Access Statistics for this paper
More papers in ZEW Discussion Papers from ZEW - Leibniz Centre for European Economic Research Contact information at EDIRC.
Bibliographic data for series maintained by ZBW - Leibniz Information Centre for Economics ().