An Algorithm for Matching Heterogeneous Financial Databases: A Case Study for COMPUSTAT/CRSP and I/B/E/S Databases
Irene Rodriguez-Lujan and
Ramon Huerta
Applied Economics and Finance, 2016, vol. 3, issue 1, 161-172
Abstract:
Rigorous and proper linking of financial databases is a necessary step to test trading strategies incorporating multimodal sources of information. This paper proposes a machine learning solution to match companies in heterogeneous financial databases. Our method, named Financial Attribute Selection Distance (FASD), has two stages, each of them corresponding to one of the two interrelated tasks commonly involved in heterogeneous database matching problems: schema matching and entity matching. FASD's schema matching procedure is based on the Kullback-Leibler divergence of string and numeric attributes. FASD's entity matching solution relies on learning a company distance flexible enough to deal with the numeric and string attribute links found by the schema matching algorithm, and it incorporates different string matching approaches such as edit-based and token-based metrics. The parameters of the distance are optimized using the F-score as cost function. FASD is able to match the joint Compustat/CRSP and Institutional Brokers' Estimate System (I/B/E/S) databases with an F-score over 0.94 using only a hundred of manually labeled company links.
Keywords: Compustat/CRSP; I/B/E/S; financial data; heterogeneous databases; company matching; schema matching; attribute matching; Kullback-Leibler divergence (search for similar items in EconPapers)
Date: 2016
References: Add references at CitEc
Citations:
Downloads: (external link)
http://redfame.com/journal/index.php/aef/article/view/1164/1331 (application/pdf)
http://redfame.com/journal/index.php/aef/article/view/1164 (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:rfa:aefjnl:v:3:y:2016:i:1:p:161-172
Access Statistics for this article
More articles in Applied Economics and Finance from Redfame publishing Contact information at EDIRC.
Bibliographic data for series maintained by Redfame publishing ().