The bootstrap procedure in classification problems
Borislava Petrova Vrigazova and
Ivan Ganchev Ivanov
International Journal of Data Mining, Modelling and Management, 2020, vol. 12, issue 4, 428-446
Abstract:
In classification problems, cross-validation chooses random samples from the dataset in order to improve the ability of the model to classify properly new observations in the respective class. Research articles from various fields show that when applied to regression problems, the bootstrap can improve either the prediction ability of the model or the ability for feature selection. The purpose of our research is to show that the bootstrap as a model selection procedure in classification problems can outperform cross-validation. We compare the performance measures of cross-validation and the bootstrap on a set of classification problems and analyse their practical advantages and disadvantages. We show that the bootstrap procedure can accelerate execution time compared to the cross-validation procedure while preserving the accuracy of the classification model. This advantage of the bootstrap is particularly important in big datasets as the time needed for fitting the model can be reduced without decreasing the model's performance.
Keywords: logistic regression; decision tree; k-nearest neighbour; KNN; the bootstrap; cross-validation. (search for similar items in EconPapers)
Date: 2020
References: Add references at CitEc
Citations: View citations in EconPapers (1)
Downloads: (external link)
http://www.inderscience.com/link.php?id=111400 (text/html)
Access to full text is restricted to subscribers.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:ids:ijdmmm:v:12:y:2020:i:4:p:428-446
Access Statistics for this article
More articles in International Journal of Data Mining, Modelling and Management from Inderscience Enterprises Ltd
Bibliographic data for series maintained by Sarah Parker ().