Comparative evaluation of machine learning models for predicting Cimbex quadrimaculata population density across multiple problem formulations
Yunus Güral
PLOS ONE, 2026, vol. 21, issue 4, 1-1
Abstract:
The high variability and nonlinear relationships between environmental variables (such as temperature, relative humidity, and altitude) in ecological datasets prevent classical statistical models from obtaining accurate predictions. This study aimed to compare and investigate the performance of AI-based machine learning methods in analyzing complex ecological data structures. An agricultural dataset containing meteorological and vegetation variables was used as the representative case study. This dataset is based on population observations of Cimbex quadrimaculata in Diyarbakır (Eğil) and Elazığ (Keban) provinces in Türkiye between 2020 and 2022. Three different modeling approaches (binary classification, multiclass classification, and regression) were applied to the same data. This three-approach design enabled a systematic comparison of model performance, generalizability, and explainability on the same dataset using different definitions of the target variable. For classification tasks, the model performance was evaluated using accuracy, F1 score, and AUC metrics under a stratified 10-fold cross-validation scheme. Regression models, on the other hand, were assessed within a nested cross-validation framework using R², root mean square error (RMSE), mean absolute error (MAE). Ensemble-based boosting AI algorithms (Gradient Boosting, XGBoost, and LightGBM) demonstrated high accuracy and generalizability in characterizing the highly nonlinear relationships, nested effects, and non-additive interactions among multiple variables. Furthermore, the SHAP analysis improved the interpretability of the models and revealed that temperature- and humidity-related variables were consistently among the most influential predictors in the model predictions. Comparative performance evaluations of machine learning models showed that Gradient Boosting (94.3% accuracy, 0.983 AUC) and XGBoost (84.6% accuracy) were the strongest predictors in binary classification scenarios and overall analyses, respectively. In regression analyses, LightGBM and Random Forest algorithms stood out with cross-validation performances of approximately R² ≈ 0.73. In particular, the success of ensemble-based learning methods in capturing multidimensional relationships in ecological datasets explains the high predictive accuracy and robustness of these models across complex ecological data structures.
Date: 2026
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0346494 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 46494&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0346494
DOI: 10.1371/journal.pone.0346494
Access Statistics for this article
More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().