Creating Powerful Indicators for Innovation Studies with Approximate Matching Algorithms. A test based on PATSTAT and Amadeus databases
Grid Thoma () and
Salvatore Torrisi ()
No 211, KITeS Working Papers from KITeS, Centre for Knowledge, Internationalization and Technology Studies, Universita' Bocconi, Milano, Italy
The lack of firm-level data on innovative activities has always constrained the development of empirical studies on innovation. More recently, the availability of large datasets on indicators, such as R&D expenditures and patents, has relaxed these constrains and spurred the growth of a new wave of research. However, measuring innovation still remains a difficult task for reasons linked to the quality of available indicators and the difficulty of integrating innovation indicators to other firm-level data. As regards quality, data on R&D expenditures represent a measure of input but do not tell much about the ‘success’ of innovative activities. Moreover, especially in the case of European firms, data on R&D expenditures are often missing because reporting these expenditures is not required by accounting and fiscal regulations in some countries. An increasing number of studies have used patents counts as a measure of inventive output. However, crude patent counts are a biased indicator of inventive output because they do not account for differences in the value of patented inventions. This is the reason why innovation scholars have introduced various patent-related indicators as a measure of the ‘quality’ of the inventive output. Integrating these measures of inventive activity with other firm-level information, such as accounting and financial data, is another challenging task. A major problem in this field is represented by the difficulty of harmonizing information from different data sources. This is a relevant issue since inaccuracy in data merging and integration leads to measurement errors and biased results. An important source of measurement error arises from inaccuracies in matching data on innovators across different datasets. This study reports on a test of company names standardization and matching. Our test is based on two data sources: the PATSTAT patent database and the Amadeus accounting and financial dataset. Earlier studies have mostly relied on manual, ad-hoc methods. More recently scholars have started experimenting with automatic matching techniques. This paper contributes to this body of research by comparing two different approaches – the character-tocharacter match of standardized company names (perfect matching) and the approximate matching based on string similarity functions. Our results show that approximate matching yields substantial gains over perfect matching, in terms of frequency of positive matches, with a limited loss of precision – i.e., low rates of false matches and false negatives.
Keywords: innovation statistics; patents; matching company names; software. (search for similar items in EconPapers)
JEL-codes: C81 C88 O31 O34 (search for similar items in EconPapers)
Pages: pages 24
Date: 2007-12, Revised 2007-12
New Economics Papers: this item is included in nep-acc, nep-ino, nep-ipr and nep-pr~
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (9) Track citations by RSS feed
Downloads: (external link)
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
Persistent link: https://EconPapers.repec.org/RePEc:cri:cespri:wp211
Ordering information: This working paper can be ordered from
E G E A - via R. Sarfatti, 25 - 20136 Milano -Italy
Access Statistics for this paper
More papers in KITeS Working Papers from KITeS, Centre for Knowledge, Internationalization and Technology Studies, Universita' Bocconi, Milano, Italy via Sarfatti, 25 - 20136 Milano - Italy.
Bibliographic data for series maintained by Valerio Sterzi ( this e-mail address is bad, please contact ).