EconPapers    
Economics at your fingertips  
 

A machine learning approach to create blocking criteria for record linkage

Phan Giang ()

Health Care Management Science, 2015, vol. 18, issue 1, 93-105

Abstract: Record linkage, a part of data cleaning, is recognized as one of most expensive steps in data warehousing. Most record linkage (RL) systems employ a strategy of using blocking filters to reduce the number of pairs to be matched. A blocking filter consists of a number of blocking criteria. Until recently, blocking criteria are selected manually by domain experts. This paper proposes a new method to automatically learn efficient blocking criteria for record linkage. Our method addresses the lack of sufficient labeled data for training. Unlike previous works, we do not consider a blocking filter in isolation but in the context of an accompanying matcher which is employed after the blocking filter. We show that given such a matcher, the labels (assigned to record pairs) that are relevant for learning are the labels assigned by the matcher (link/nonlink), not the labels assigned objectively (match/unmatch). This conclusion allows us to generate an unlimited amount of labeled data for training. We formulate the problem of learning a blocking filter as a Disjunctive Normal Form (DNF) learning problem and use the Probably Approximately Correct (PAC) learning theory to guide the development of algorithm to search for blocking filters. We test the algorithm on a real patient master file of 2.18 million records. The experimental results show that compared with filters obtained by educated guess, the optimal learned filters have comparable recall but reduce throughput (runtime) by an order-of-magnitude factor. Copyright Springer Science+Business Media New York 2015

Keywords: Record linkage; Machine learning; Blocking criteria; Disjunctive Normal Form (DNF) learning (search for similar items in EconPapers)
Date: 2015
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
http://hdl.handle.net/10.1007/s10729-014-9276-0 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:kap:hcarem:v:18:y:2015:i:1:p:93-105

Ordering information: This journal article can be ordered from
http://www.springer.com/journal/10729

DOI: 10.1007/s10729-014-9276-0

Access Statistics for this article

Health Care Management Science is currently edited by Yasar Ozcan

More articles in Health Care Management Science from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-19
Handle: RePEc:kap:hcarem:v:18:y:2015:i:1:p:93-105