EconPapers    
Economics at your fingertips  
 

Evaluating entity-description conflict on duplicated data

Lingli Li (), Jianzhong Li and Hong Gao
Additional contact information
Lingli Li: Harbin Institute of Technology
Jianzhong Li: Harbin Institute of Technology
Hong Gao: Harbin Institute of Technology

Journal of Combinatorial Optimization, 2016, vol. 31, issue 2, No 31, 918-941

Abstract: Abstract Duplicated records, which describe the same entity in the real world, frequently generated by data integration. Ideally, the values on the same attributes of duplicated records should be identical. However, the duplicated records may have conflicting values on the same attributes due to ambiguity and data errors. Obviously, the more the conflicts there are among duplicated records in a data set, the poorer the quality of the data set is. To address the problem, we explore a new data quality measure, entity-description conflict, to evaluate the conflict on duplicated records. Since current entity resolution algorithms can hardly identify duplicated records correctly and completely, it brings challenges to compute the entity-description conflict. To this end, it is studied to compute the range of the entity-description conflict while the entity resolution result is not completely correct in this paper. (1) The mathematics model of the entity-description conflict is introduced. (2) Four primary operators for computing the range of the entity-description conflict are identified and are proved to be NP-hard, and thus it is proved that the problem of computing the range of the entity-description conflict is NP-hard. (3) Four approximation algorithms for the four primary operators are provided and a framework based on the four primary operators is proposed for computing the range of the entity-description conflict. (4) Using real-life data and synthetic data, the effectiveness and efficiency of the proposed algorithms are experimentally verified.

Keywords: Entity-description conflict; Evaluation; Data quality; Data integration (search for similar items in EconPapers)
Date: 2016
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s10878-014-9801-6 Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:jcomop:v:31:y:2016:i:2:d:10.1007_s10878-014-9801-6

Ordering information: This journal article can be ordered from
https://www.springer.com/journal/10878

DOI: 10.1007/s10878-014-9801-6

Access Statistics for this article

Journal of Combinatorial Optimization is currently edited by Thai, My T.

More articles in Journal of Combinatorial Optimization from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-20
Handle: RePEc:spr:jcomop:v:31:y:2016:i:2:d:10.1007_s10878-014-9801-6