An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance
Samir K. Safi () and
Sheema Gul
Additional contact information
Samir K. Safi: Department of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates
Sheema Gul: Department of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates
Mathematics, 2024, vol. 12, issue 20, 1-17
Abstract:
Researchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective when the class imbalance ratio is extremely high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE), based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag ( E T E O O B ) and sub-samples ( E T E S S ) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest ( R F S M O T E ), oversampling random forest ( R F O S ), under-sampling random forest ( R F U S ), k-nearest neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods.
Keywords: random forest; tree selection; classification; class-imbalance problem; synthetic data generation (search for similar items in EconPapers)
JEL-codes: C (search for similar items in EconPapers)
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2227-7390/12/20/3243/pdf (application/pdf)
https://www.mdpi.com/2227-7390/12/20/3243/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jmathe:v:12:y:2024:i:20:p:3243-:d:1500258
Access Statistics for this article
Mathematics is currently edited by Ms. Emma He
More articles in Mathematics from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().