GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data

Hu, Libin; Zhang, Yunfeng

GDSMOTE: A Novel Synthetic Oversampling Method for High-Dimensional Imbalanced Financial Data

Libin Hu and Yunfeng Zhang ()
Additional contact information
Libin Hu: School of Management Science and Engineering, Shandong University of Finance and Economics, Jinan 250014, China
Yunfeng Zhang: School of Computer Science and Technology, Shandong University of Finance and Economics, Jinan 250014, China

Mathematics, 2024, vol. 12, issue 24, 1-19

Abstract: Synthetic oversampling methods for dealing with imbalanced classification problems have been widely studied. However, the current synthetic oversampling methods still cannot perform well when facing high-dimensional imbalanced financial data. The failure of distance measurement in high-dimensional space, error accumulation caused by noise samples, and the reduction of recognition accuracy of majority samples caused by the distribution of synthetic samples are the main reasons that limit the performance of current methods. Taking these factors into consideration, a novel synthetic oversampling method is proposed, namely the gradient distribution-based synthetic minority oversampling technique (GDSMOTE). Firstly, the concept of gradient contribution was used to assign the minority-class samples to different gradient intervals instead of relying on the spatial distance. Secondly, the root sample selection strategy of GDSMOTE avoids the error accumulation caused by noise samples and a new concept of nearest neighbor was proposed to determine the auxiliary samples. Finally, a safety gradient distribution approximation strategy based on cosine similarity was designed to determine the number of samples to be synthesized in each safety gradient interval. Experiments on high-dimensional imbalanced financial datasets show that GDSMOTE can achieve a higher F1-Score and MCC metrics than baseline methods while achieving a higher recall score. This means that our method has the characteristics of improving the recognition accuracy of minority-class samples without sacrificing the recognition accuracy of majority-class samples and has good adaptability to data decision-making tasks in the financial field.

Keywords: synthetic oversampling; high-dimensional imbalanced financial data; gradient distribution; gradient right nearest neighbor; safety gradient distribution approximation (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2227-7390/12/24/4036/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/24/4036/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:24:p:4036-:d:1550648

Access Statistics for this article

Mathematics is currently edited by Ms. Emma He

More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().