Data Quality Improvement for Financial Distress Prediction: Feature Selection, Data Re‐Sampling, and Their Combinations in Different Orders

Tsai, Chih‐Fong; Lin, Wei‐Chao; Chen, Yi‐Hsien

Data Quality Improvement for Financial Distress Prediction: Feature Selection, Data Re‐Sampling, and Their Combinations in Different Orders

Chih‐Fong Tsai, Wei‐Chao Lin and Yi‐Hsien Chen

Journal of Forecasting, 2025, vol. 44, issue 7, 2205-2229

Abstract: In financial distress prediction (FDP), it is very important to ensure the quality of the data for developing effective prediction models. Related studies often apply feature selection to filter out some unrepresentative features from a set of financial ratios, or data re‐sampling to re‐balance class imbalanced FDP training sets. Although these two types of data pre‐processing methods have been demonstrated their effectiveness, they have not often been applied at the same time to develop FDP models. Moreover, the performances of various feature selection algorithms, which can be divided into filter, wrapper, and embedded methods, and data re‐sampling algorithms, which can be divided into under‐sampling, over‐sampling, and hybrid sampling methods, have not been fully investigated in FDP. Therefore, in this study several feature selection and data re‐sampling methods, which are employed alone and in combination by different orders are compared. The experimental results based on nine FDP datasets show that executing data re‐sampling alone always outperforms executing feature selection alone to develop FDP models, in which hybrid sampling is the better choice. In most cases, better prediction performances can be obtained by performing feature selection first and data re‐sampling second. The best combined algorithms are based on the decision tree method for feature selection and Synthetic Minority Over‐sampling Technique‐Edited Nearest Neighbors (SMOTE‐ENN) for hybrid sampling. This combination allows the random forest classifier to produce the highest rate of prediction accuracy. On the other hand, for the Type I error, where crisis cases are misclassified into the non‐crisis class, the lowest error rate is produced by executing under‐sampling alone using the ClusterCentroids algorithm combined with the random forest classifier.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://doi.org/10.1002/for.70002

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:wly:jforec:v:44:y:2025:i:7:p:2205-2229

Access Statistics for this article

Journal of Forecasting is currently edited by Derek W. Bunn

More articles in Journal of Forecasting from John Wiley & Sons, Ltd.
Bibliographic data for series maintained by Wiley Content Delivery ().