Machine Learning Pipeline for Early Diabetes Detection: A Comparative Study with Explainable AI
Yas Barzegar (),
Atrin Barzegar,
Francesco Bellini,
Fabrizio D'Ascenzo,
Irina Gorelova and
Patrizio Pisani
Additional contact information
Yas Barzegar: Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy
Atrin Barzegar: Mathematics, Physics and Applications to Engineering Department, Università degli Studi della Campania “Luigi Vanvitelli”, Viale Lincoln n°5, 81100 Caserta, Italy
Francesco Bellini: Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy
Fabrizio D'Ascenzo: Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy
Irina Gorelova: Department of Management, Banking and Commodity Sciences, Sapienza University, 00161 Rome, Italy
Patrizio Pisani: Unidata S.p.A., Viale A. G. Eiffel, 00148 Roma, Italy
Future Internet, 2025, vol. 17, issue 11, 1-23
Abstract:
The use of Artificial Intelligence (AI) in healthcare has significantly advanced early disease detection, enabling timely diagnosis and improved patient outcomes. This work proposes an end-to-end machine learning (ML) model for predicting diabetes based on data quality by following key steps, including advanced preprocessing by KNN imputation, intelligent feature selection, class imbalance with a hybrid approach of SMOTEENN, and multi-model classification. We rigorously compared nine ML classifiers, namely ensemble approaches (Random Forest, CatBoost, XGBoost), Support Vector Machines (SVM), and Logistic Regression (LR) for the prediction of diabetes disease. We evaluated performance on specificity, accuracy, recall, precision, and F1-score to assess generalizability and robustness. We employed SHapley Additive exPlanations (SHAP) for explainability, ranking, and identifying the most influential clinical risk factors. SHAP analysis identified glucose levels as the dominant predictor, followed by BMI and age, providing clinically interpretable risk factors that align with established medical knowledge. Results indicate that ensemble models have the highest performance among the others, and CatBoost performed the best, which achieved an ROC-AUC of 0.972, an accuracy of 0.968, and an F1-score of 0.971. The model was successfully validated on two larger datasets (CDC BRFSS and a 130-hospital dataset), confirming its generalizability. This data-driven design provides a reproducible platform for applying useful and interpretable ML models in clinical practice as a primary application for future Internet-of-Things-based smart healthcare systems.
Keywords: AI; smart healthcare; ML; diagnosis; hybrid resampling; interpretability; feature selection; future internet (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/1999-5903/17/11/513/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/11/513/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:11:p:513-:d:1791522
Access Statistics for this article
Future Internet is currently edited by Ms. Grace You
More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().