EconPapers    
Economics at your fingertips  
 

Preventing Dataset Shift During Cross-Validation: Is it worth it?

Tim Angelike and Martin Papenberg

No b5fus_v1, OSF Preprints from Center for Open Science

Abstract: Dataset shift occurs when there is a discrepancy between the distribution of samples used for training and testing a model, which can lead to a reduction in the model’s predictive performance when attempting to generalize model predictions. Dataset shift can occur with respect to the 1) criterion variable (prior probability shift), 2) predictor variables (covariate shift), and 3) the relationship between criterion and predictors (concept shift). The present paper investigated the implications of avoiding dataset shift during k-fold cross-validation. To circumvent the various forms of dataset shift during cross-validation, a range of anticlustering algorithms were employed to ensure equal means, variances, and covariance structure across folds. A comparative analysis was conducted to assess the bias in validation error obtained through anticlustering-based cross-validation and standard cross-validation. The bias in validation error was computed using the true test error based on unseen test data as reference. Utilizing linear regression models, simulated and empirical datasets demonstrated that avoiding prior probability shift in conjunction with covariate shift—but not concept shift—can effectively curtail the bias in the prediction error during cross-validation. The advantage of using anticlustering-based cross-validation over standard cross-validation however diminished when R2 increased. The study underscores the merits of leveraging anticlustering methodologies within the framework of k-fold cross-validation to mitigate adverse effects of prior probability and covariate shift by equating means and variances in predictor and criterion variables across folds.

Date: 2025-04-08
References: Add references at CitEc
Citations:

Downloads: (external link)
https://osf.io/download/67f67105167e48cd67b2526f/

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:osf:osfxxx:b5fus_v1

DOI: 10.31219/osf.io/b5fus_v1

Access Statistics for this paper

More papers in OSF Preprints from Center for Open Science
Bibliographic data for series maintained by OSF ().

 
Page updated 2025-04-12
Handle: RePEc:osf:osfxxx:b5fus_v1