A Novel Suite of Methods for Mixture Based Record Linkage
Diego Zardetto (zardetto@istat.it) and
Monica Scannapieco (scannapi@istat.it)
Additional contact information
Monica Scannapieco: Italian National Institute of Statistics
Rivista di statistica ufficiale, 2010, vol. 12, issue 2-3, 31-58
Abstract:
Record Linkage (RL) aims at identifying pairs of records coming from different sources and representing the same real world object. Despite several methods have been proposed to face RL problems, none of them seems to be at the same time fully automated and very effective. In this paper we present a novel suite of methods that instead possesses both these abilities. We adopt a mixt pure-model based approach, which structures a RL process into two consecutive tasks. First, mixture parameters are estimated by fitting the model to observed distance measures between pairs. Then, a probabilistic clustering of the pairs into Matches and Unmatches is obtained by exploiting the fitted model. In particular, we use a mixture model with component densities belonging to the Beta parametric family and we fit it by means of an original perturbation-like technique. Moreover, we solve the clustering problem according to both Maximum Likelihood and Minimum Cost objectives. To accomplish this task, optimal decision rules fulfilling one-to-one matching constraints are searched by a purposefully designed evolutionary algorithm. We present several experiments on real data that validate our methods and show their excellent effectiveness
Keywords: Record linkage; Mixture parameters (search for similar items in EconPapers)
JEL-codes: C81 C89 (search for similar items in EconPapers)
Date: 2010
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
http://www.istat.it/it/files/2011/09/2-3_2010_2.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:isa:journl:v:12:y:2010:i:2-3:p:31-58
Access Statistics for this article
More articles in Rivista di statistica ufficiale from ISTAT - Italian National Institute of Statistics - (Rome, ITALY) Contact information at EDIRC.
Bibliographic data for series maintained by Stefania Rossetti (strosset@istat.it).