EconPapers    
Economics at your fingertips  
 

Ranking the blocking keys for data de-duplication in information systems

Asif Sohail and Syed Waqar Jaffry

International Journal of Business Information Systems, 2025, vol. 49, issue 2, 180-198

Abstract: Data de-duplication is an essential activity in data integration and data cleansing. It identifies and removes the disguised duplicates in a dataset. Blocking is an established technique for reducing the inherent quadratic complexity of de-duplication. Blocking gathers the potential matching records in the same block on the basis of a blocking key. The results of blocking fluctuate considerably when different blocking keys are employed. Hence, it becomes extremely important to select an appropriate blocking key for maximising the efficacy and efficiency of blocking. The proposed technique ranks the attributes of a dataset with respect to their usability as a blocking key. We have introduced a novel correlation measure called R-score for computing correlation between gold rankings and computed rankings of the blocking keys. The proposed technique is evaluated using benchmark datasets and the experimental results confirm that the proposed technique outperforms the existing techniques.

Keywords: data integration; data cleansing; blocking; candidate record pairs; rank correlation; promising attributes; reduction ratio; recall. (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
http://www.inderscience.com/link.php?id=146600 (text/html)
Access to full text is restricted to subscribers.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:ids:ijbisy:v:49:y:2025:i:2:p:180-198

Access Statistics for this article

More articles in International Journal of Business Information Systems from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().

 
Page updated 2025-06-10
Handle: RePEc:ids:ijbisy:v:49:y:2025:i:2:p:180-198