Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning

Savalli, Carine; Wichmann, Roberta; Filho, Fabiano Barcellos; Fernandes, Fernando Timoteo; Filho, Alexandre Dias Porto Chiavegatto; Network, on behalf of IACOV-BR

Multicenter comparative analysis of local and aggregated data training strategies in COVID-19 outcome prediction with Machine learning

Carine Savalli, Roberta Wichmann (), Fabiano Barcellos Filho, Fernando Timoteo Fernandes, Alexandre Dias Porto Chiavegatto Filho and on behalf of IACOV-BR Network

PLOS Digital Health, 2024, vol. 3, issue 12, 1-13

Abstract: Machine learning (ML) is a promising tool in assisting clinical decision-making for improving diagnosis and prognosis, especially in developing regions. It is often used with large samples, aggregating data from different regions and hospitals. However, it is unclear how this affects predictions in local centers. This study aims to compare data aggregation strategies of several hospitals in Brazil with a local training strategy in each hospital to predict two COVID-19 outcomes: Intensive Care Unit admission (ICU) and mechanical ventilation use (MV). The study included 6,046 patients from 14 hospitals, with local sample sizes ranging from 47 to 1500 patients. Machine learning models were trained using extreme gradient boosting, lightGBM, and catboost for structured data. Seven data aggregation strategies based on hospital geographic regions were compared with local training, and the best strategy was determined by analyzing the area under the ROC curve (AUROC). SHAP (Shapley Additive exPlanations) values were used to assess the contribution of variables to predictions. Additionally, a metafeatures analysis examined how hospital characteristics influence the selection of the best strategy. The study found that the local training strategy was the most effective approach, in the case of ICU outcomes, for 11 of the 14 hospitals (79%), and, in the case of MV, for 10 hospitals (71%). Metafeatures analysis suggested that hospitals with smaller sample sizes generally performed better using an aggregated data strategy compared to local training. Our study brings to light an important concern about the impact of grouping data from different hospitals in predictive machine learning models. These findings contribute to the ongoing debate about the trade-off between increasing sample size and bringing together heterogeneous scenarios.Author summary: Machine learning (ML) in healthcare is often applied to large datasets, and a common strategy is to combine data from different regions and hospitals to increase sample sizes. In this study, we used ML models to predict two COVID-19-related outcomes: Intensive Care Unit admission (ICU) and mechanical ventilation (MV) use. We proposed different groupings of hospitals based on geographic regions and compared these with results obtained from individual hospitals (referred to as local training). The study found that local training generally provided more accurate predictions for the two COVID-19 outcomes. However, grouping hospitals for training prediction models was beneficial in cases where individual hospitals had few patients. We concluded that it is crucial to consider the local context before combining data from different centers with high data diversity.

Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000699 (text/html)
https://journals.plos.org/digitalhealth/article/fi ... 00699&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pdig00:0000699

DOI: 10.1371/journal.pdig.0000699

Access Statistics for this article

More articles in PLOS Digital Health from Public Library of Science
Bibliographic data for series maintained by digitalhealth ().