EconPapers    
Economics at your fingertips  
 

An Algorithm for Matching Heterogeneous Financial Databases: A Case Study for COMPUSTAT/CRSP and I/B/E/S Databases

Irene Rodriguez-Lujan and Ramon Huerta

Applied Economics and Finance, 2016, vol. 3, issue 1, 161-172

Abstract: Rigorous and proper linking of financial databases is a necessary step to test trading strategies incorporating multimodal sources of information. This paper proposes a machine learning solution to match companies in heterogeneous financial databases. Our method, named Financial Attribute Selection Distance (FASD), has two stages, each of them corresponding to one of the two interrelated tasks commonly involved in heterogeneous database matching problems: schema matching and entity matching. FASD's schema matching procedure is based on the Kullback-Leibler divergence of string and numeric attributes. FASD's entity matching solution relies on learning a company distance flexible enough to deal with the numeric and string attribute links found by the schema matching algorithm, and it incorporates different string matching approaches such as edit-based and token-based metrics. The parameters of the distance are optimized using the F-score as cost function. FASD is able to match the joint Compustat/CRSP and Institutional Brokers' Estimate System (I/B/E/S) databases with an F-score over 0.94 using only a hundred of manually labeled company links.

Keywords: Compustat/CRSP; I/B/E/S; financial data; heterogeneous databases; company matching; schema matching; attribute matching; Kullback-Leibler divergence (search for similar items in EconPapers)
Date: 2016
References: Add references at CitEc
Citations:

Downloads: (external link)
http://redfame.com/journal/index.php/aef/article/view/1164/1331 (application/pdf)
http://redfame.com/journal/index.php/aef/article/view/1164 (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:rfa:aefjnl:v:3:y:2016:i:1:p:161-172

Access Statistics for this article

More articles in Applied Economics and Finance from Redfame publishing Contact information at EDIRC.
Bibliographic data for series maintained by Redfame publishing ().

 
Page updated 2025-03-19
Handle: RePEc:rfa:aefjnl:v:3:y:2016:i:1:p:161-172