EconPapers    
Economics at your fingertips  
 

Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets

Pedro Martins (), Filipe Cardoso, Paulo Váz, José Silva and Maryam Abbasi
Additional contact information
Pedro Martins: Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal
Filipe Cardoso: Polytechnic Institute of Santarém, Escola Superior de Gestão e Tecnologia de Santarém, 2001-904 Santarém, Portugal
Paulo Váz: Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal
José Silva: Research Center in Digital Services, Polytechnic of Viseu, 3504-510 Viseu, Portugal
Maryam Abbasi: Applied Research Institute, Polytechnic of Coimbra, 3045-093 Coimbra, Portugal

Data, 2025, vol. 10, issue 5, 1-22

Abstract: Data cleaning remains one of the most time-consuming and critical steps in modern data science, directly influencing the reliability and accuracy of downstream analytics. In this paper, we present a comprehensive evaluation of five widely used data cleaning tools—OpenRefine, Dedupe, Great Expectations, TidyData (PyJanitor), and a baseline Pandas pipeline—applied to large-scale, messy datasets spanning three domains (healthcare, finance, and industrial telemetry). We benchmark each tool on dataset sizes ranging from 1 million to 100 million records, measuring execution time, memory usage, error detection accuracy, and scalability under increasing data volumes. Additionally, we assess qualitative aspects such as usability and ease of integration, reflecting real-world adoption concerns. We incorporate recent findings on parallelized data cleaning and highlight how domain-specific anomalies (e.g., negative amounts in finance, sensor corruption in industrial telemetry) can significantly impact tool choice. Our findings reveal that no single solution excels across all metrics; while Dedupe provides robust duplicate detection and Great Expectations offers in-depth rule-based validation, tools like TidyData and baseline Pandas pipelines demonstrate strong scalability and flexibility under chunk-based ingestion. The choice of tool ultimately depends on domain-specific requirements (e.g., approximate matching in finance and strict auditing in healthcare) and the magnitude of available computational resources. By highlighting each framework’s strengths and limitations, this study offers data practitioners clear, evidence-driven guidance for selecting and combining tools to tackle large-scale data cleaning challenges.

Keywords: data cleaning; large-scale benchmarking; duplicate detection; data validation; healthcare; finance (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2306-5729/10/5/68/pdf (application/pdf)
https://www.mdpi.com/2306-5729/10/5/68/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:10:y:2025:i:5:p:68-:d:1649421

Access Statistics for this article

Data is currently edited by Ms. Cecilia Yang

More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-05-06
Handle: RePEc:gam:jdataj:v:10:y:2025:i:5:p:68-:d:1649421