EconPapers    
Economics at your fingertips  
 

Comparative Analysis of Resampling Techniques for Class Imbalance in Financial Distress Prediction Using XGBoost

Guodong Hou, Dong Ling Tong (), Soung Yue Liew and Peng Yin Choo
Additional contact information
Guodong Hou: Faculty of Information and Communication Technology, University Tunku Abdul Rahman, Kampar Campus, Kampar 31900, Perak, Malaysia
Dong Ling Tong: Faculty of Information and Communication Technology, University Tunku Abdul Rahman, Kampar Campus, Kampar 31900, Perak, Malaysia
Soung Yue Liew: Faculty of Information and Communication Technology, University Tunku Abdul Rahman, Kampar Campus, Kampar 31900, Perak, Malaysia
Peng Yin Choo: Faculty of Information and Communication Technology, University Tunku Abdul Rahman, Kampar Campus, Kampar 31900, Perak, Malaysia

Mathematics, 2025, vol. 13, issue 13, 1-21

Abstract: One of the key challenges in financial distress data is class imbalance, where the data are characterized by a highly imbalanced ratio between the number of distressed and non-distressed samples. This study examines eight resampling techniques for improving distress prediction using the XGBoost algorithm. The study was performed on a dataset acquired from the CSMAR database, containing 26,383 firm-quarter samples from 639 Chinese A-share listed companies (2007–2024), with only 12.1% of the cases being distressed. Results show that standard Synthetic Minority Oversampling Technique (SMOTE) enhanced F1-score (up to 0.73) and Matthews Correlation Coefficient (MCC, up to 0.70), while SMOTE-Tomek and Borderline-SMOTE further boosted recall, slightly sacrificing precision. These oversampling and hybrid methods also maintained reasonable computational efficiency. However, Random Undersampling (RUS), though yielding high recall (0.85), suffered from low precision (0.46) and weaker generalization, but was the fastest method. Among all techniques, Bagging-SMOTE achieved balanced performance (AUC 0.96, F1 0.72, PR-AUC 0.80, MCC 0.68) using a minority-to-majority ratio of 0.15, demonstrating that ensemble-based resampling can improve robustness with minimal impact on the original class distribution, albeit with higher computational cost. The compared findings highlight that no single approach fits all use cases, and technique selection should align with specific goals. Techniques favoring recall (e.g., Bagging-SMOTE, SMOTE-Tomek) are suited for early warning, while conservative techniques (e.g., Tomek Links) help reduce false positives in risk-sensitive applications, and efficient methods such as RUS are preferable when computational speed is a priority.

Keywords: class imbalance; SMOTE; ADASYN; Borderline-SMOTE; SMOTE-Tomek; SMOTE-ENN; RUS; Tomek Links; Bagging-SMOTE; XGBoost (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/13/13/2186/pdf (application/pdf)
https://www.mdpi.com/2227-7390/13/13/2186/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:13:y:2025:i:13:p:2186-:d:1694825

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-07-05
Handle: RePEc:gam:jmathe:v:13:y:2025:i:13:p:2186-:d:1694825