Using imputation algorithms when missing values appear in the test data in contrast with the training data
Narges Sadat Bathaeian
International Journal of Data Analysis Techniques and Strategies, 2018, vol. 10, issue 2, 111-123
Abstract:
Real datasets suffer from the problem of missing data. Imputation is a common solution for this problem. Most of research works perform imputation algorithms to training data. Therefore, the output variable of samples might influence the imputation model. This paper aims to compare different imputation algorithms when they are applied to test data and training data. In this research, first, the relations between output variable and different imputation algorithms are investigated. Then six different types of imputation algorithms are applied to both training data and test data. Chosen datasets are globally available, and cover both classification and regression tasks. Also missing values are injected artificially to them. The results showed that performance of all algorithms will reduce in the case of elimination of output variable. Particularly, decline in algorithm, which uses k nearest neighbour for imputation in the classification datasets is not ignorable. Nevertheless, algorithms that are based on random forests have less decline and show better results compared with other five types of algorithms.
Keywords: missing values; imputation algorithms; regression; kNN; MICE; random forest; tree; EM. (search for similar items in EconPapers)
Date: 2018
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.inderscience.com/link.php?id=92447 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:injdan:v:10:y:2018:i:2:p:111-123
Access Statistics for this article
More articles in International Journal of Data Analysis Techniques and Strategies from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().