Data quality improvement in data warehouse: a framework
Rajiv Arora,
Payal Pahwa and
Daya Gupta
International Journal of Data Analysis Techniques and Strategies, 2017, vol. 9, issue 1, 17-33
Abstract:
Data cleansing is an extremely imperative process which when carried out on the datasets, eliminates the inconsistency and duplicity from the data. It also handles null values or missing values in the data in an organised and proper manner thereby enhancing the quality of the data. In this paper, we use Kullback-Leibler divergence (KL-divergence) technique to eliminate duplicity in the datasets. Inconsistency, null values or missing values are also handled in the datasets. This is done by maintaining data marts which are made on the basis of test data. Accordingly, a framework for efficient data cleansing is suggested in order to make the data appropriate and proper for decision making purpose. A brief comparison of existing approaches of data cleansing have also been discussed. This comparison is based on various parameters such as prediction error, bias, mean square error, variance, mean absolute error, root mean square error, Theil statistics etc. These parameters are used by distance sum-based approach (DSA) to accomplish the task. The results obtained demonstrate the feasibility and validity of our method.
Keywords: Kullback-Leibler divergence; KL-divergence; data cleansing; data pruning; distance sum-based approach; DSA; data quality improvement; data warehouse; prediction error; bias; mean square error; variance; mean absolute error; root mean square error; RMSE; Theil statistics; data warehousing. (search for similar items in EconPapers)
Date: 2017
References: Add references at CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
http://www.inderscience.com/link.php?id=83062 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:injdan:v:9:y:2017:i:1:p:17-33
Access Statistics for this article
More articles in International Journal of Data Analysis Techniques and Strategies from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().