Detection and Correction of Abnormal Data with Optimized Dirty Data: A New Data Cleaning Model
Kumar Rahul and
Rohitash Kumar Banyal ()
Additional contact information
Kumar Rahul: Department of Basic and Applied Science, NIFTEM, Sonipat 131028, India
Rohitash Kumar Banyal: #x2020;Department of Computer Science and Engineering, Rajasthan Technical University, Kota, 324010, India
International Journal of Information Technology & Decision Making (IJITDM), 2021, vol. 20, issue 02, 809-841
Abstract:
Each and every business enterprises require noise-free and clean data. There is a chance of an increase in dirty data as the data warehouse loads and refreshes a large quantity of data continuously from the various sources. Hence, in order to avoid the wrong conclusions, the data cleaning process becomes a vital one in various data-connected projects. This paper made an effort to introduce a novel data cleaning technique for the effective removal of dirty data. This process involves the following two steps: (i) dirty data detection and (ii) dirty data cleaning. The dirty data detection process has been assigned with the following process namely, data normalization, hashing, clustering, and finding the suspected data. In the clustering process, the optimal selection of centroid is the promising one and is carried out by employing the optimization concept. After the finishing of dirty data prediction, the subsequent process: dirty data cleaning begins to activate. The cleaning process also assigns with some processes namely, the leveling process, Huffman coding, and cleaning the suspected data. The cleaning of suspected data is performed based on the optimization concept. Hence, for solving all optimization problems, a new hybridized algorithm is proposed, the so-called Firefly Update Enabled Rider Optimization Algorithm (FU-ROA), which is the hybridization of the Rider Optimization Algorithm (ROA) and Firefly (FF) algorithm is introduced. To the end, the analysis of the performance of the implanted data cleaning method is scrutinized over the other traditional methods like Particle Swarm Optimization (PSO), FF, Grey Wolf Optimizer (GWO), and ROA in terms of their positive and negative measures. From the result, it can be observed that for iteration 12, the performance of the proposed FU-ROA model for test case 1 on was 0.013%, 0.7%, 0.64%, and 0.29% better than the extant PSO, FF, GWO, and ROA models, respectively.
Keywords: Dirty data; data prediction; data cleaning; optimization; Huffman coding (search for similar items in EconPapers)
Date: 2021
References: Add references at CitEc
Citations: View citations in EconPapers (2)
Downloads: (external link)
http://www.worldscientific.com/doi/abs/10.1142/S0219622021500188
Access to full text is restricted to subscribers
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:wsi:ijitdm:v:20:y:2021:i:02:n:s0219622021500188
Ordering information: This journal article can be ordered from
DOI: 10.1142/S0219622021500188
Access Statistics for this article
International Journal of Information Technology & Decision Making (IJITDM) is currently edited by Yong Shi
More articles in International Journal of Information Technology & Decision Making (IJITDM) from World Scientific Publishing Co. Pte. Ltd.
Bibliographic data for series maintained by Tai Tone Lim ().