Fuzzy firm name matching: Merging Amadeus firm data to PATSTAT
Leon Bremer
Additional contact information
Leon Bremer: Vrije Universiteit Amsterdam
No 23-055/VIII, Tinbergen Institute Discussion Papers from Tinbergen Institute
Abstract:
When merging firms across large databases in the absence of common identifiers, text algorithms can help. I propose a high-performance fuzzy firm name matching algorithm that uses existing computational methods and works even under hardware restrictions. The algorithm consists of four steps, namely (1) cleaning, (2) similarity scoring, (3) a decision rule based on supervised machine learning, and (4) group identification using community detection. The algorithm is applied to merging firms in the Amadeus Financials and Subsidiaries databases, containing firm-level business and ownership information, to applicants in PATSTAT, a worldwide patent database. For the application the algorithm vastly outperforms an exact string match by increasing the number of matched firms in the Amadeus Financials (Subsidiaries) database with 116% (160%). 53% (74%) of this improvement is due to cleaning, and another 41% (50%) improvement is due to similarity matching. 18.1% of all patent applications since 1950 are matched to firms in the Amadeus databases, compared to 2.6% for an exact name match.
Keywords: Fuzzy name matching; supervised machine learning; name disambiguation; patents (search for similar items in EconPapers)
JEL-codes: C81 C88 O34 (search for similar items in EconPapers)
Date: 2023-10-12
New Economics Papers: this item is included in nep-bec, nep-big, nep-cmp, nep-ipr, nep-sbm and nep-tid
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://papers.tinbergen.nl/23055.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:tin:wpaper:20230055
Access Statistics for this paper
More papers in Tinbergen Institute Discussion Papers from Tinbergen Institute Contact information at EDIRC.
Bibliographic data for series maintained by Tinbergen Office +31 (0)10-4088900 ().