AI-ECG classification for Brugada syndrome: A study of machine learning techniques to optimise for limited datasets

Saleh, Keenan; Hadadi, Raaif; Liang, Yixiu; Wong, Hong; Sau, Arunashis; Howard, James; Brittain, Evan; Annis, Jeffrey; El-Harasis, Majd; Shun-Shin, Matthew; Mohal, Jagdeep; Naraen, Akriti; Samways, Jack; Artico, Jessica; Ware, James; Kanagaratnam, Prapa; Ng, Fu Siong; Zolgharni, Massoud; Bai, Wenjia; Varnava, Amanda; Whinnett, Zachary; Arnold, Ahran

AI-ECG classification for Brugada syndrome: A study of machine learning techniques to optimise for limited datasets

Keenan Saleh, Raaif Hadadi, Yixiu Liang, Hong Wong, Arunashis Sau, James Howard, Evan Brittain, Jeffrey Annis, Majd El-Harasis, Matthew Shun-Shin, Jagdeep Mohal, Akriti Naraen, Jack Samways, Jessica Artico, James Ware, Prapa Kanagaratnam, Fu Siong Ng, Massoud Zolgharni, Wenjia Bai, Amanda Varnava, Zachary Whinnett and Ahran Arnold

PLOS Digital Health, 2026, vol. 5, issue 2, 1-20

Abstract: Deep neural networks can classify ECGs with high accuracy when training data is abundant. Rare conditions like Brugada syndrome, an inherited arrhythmia syndrome predisposing to sudden death, pose challenges due to data scarcity hindering model training. We evaluated multiple machine learning (ML) approaches to optimise a Brugada ECG classification model using limited training data. The baseline model was trained on a dataset comprising 176 Brugada, 176 right bundle branch block (RBBB) and 352 normal ECGs from Zhongshan Hospital (Zhongshan-baseline dataset), framed as a binary classification task to distinguish Brugada from non-Brugada ECGs. A 25%-75% train-test split was used to exacerbate data scarcity. To enhance training, we incorporated three additional datasets: (i) a different, labelled ECG dataset from Zhongshan Hospital including normal and RBBB ECGs (Zhongshan-pretrain), (ii) an unlabelled ECG dataset from Hammersmith Hospital including Brugada and non-Brugada ECGs (Imperial), (iii) an open-access labelled ECG dataset (PTB-XL). Three strategies were tested: (1) supervised pretraining, (2) self-supervised pretraining with data augmentation, and (3) oversampling using SMOTE (synthetic minority oversampling technique). Each model was evaluated on the unseen internal test set and an external Brugada mimic dataset. The models were re-trained using an 80%-20% train-test split as a secondary analysis. The baseline model achieved 92.2% accuracy, F1-score 0.837, and area under the Receiver Operating Characteristic curve (AUC) 0.962. Supervised pretraining significantly improved performance when training data was scarce, with the best model pretrained on the Zhongshan-pretrain dataset boosting accuracy (+3.2%), F1-score (+0.071) and AUC + 0.019), with consistent cross-validation performance. Self-supervised pretraining produced smaller and more variable gains, although select models better mitigated against false positives on the Brugada mimic dataset. SMOTE oversampling showed inconsistent effects on performance. Incorporating pretraining and oversampling may facilitate the development of more accurate AI-ECG models for rare diseases when training data is limited but provides diminishing returns when adequate labelled data is available.Author summary: AI applied to ECG interpretation (AI-ECG) is an emerging tool in the field of cardiac diagnostics that can rapidly automate ECG analysis, improving clinical resource utilisation and efficiency. However, rare conditions have limited available data to train AI-ECG models, hindering their performance. In this study, we developed a baseline AI-ECG model for Brugada syndrome using a severely restricted dataset and investigated three strategies to address data scarcity: supervised pretraining, self-supervised pretraining and oversampling. Pretraining involves training the model on a broader dataset before refining it for Brugada classification. This can be supervised, where ECG diagnoses or labels are known to the model, or self-supervised, where the model must learn patterns autonomously without labelled ECG data. Oversampling generates synthetic ECGs to supplement model training. Our results indicate that configurations of each approach provided incremental improvements in model performance and could be applied to the development of AI-ECG models for other rare cardiac diseases. Notably, we highlight the strongest improvements are achieved using supervised pretraining and the potential value of self-supervised pretraining using unlabelled datasets, reducing reliance on resource-intensive manual labelling. Together, these findings show how data-efficient training strategies can support the development of AI-ECG models for rare cardiac diseases and help ensure that advances in AI-driven healthcare do not exacerbate existing health inequalities.

Date: 2026
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0001222 (text/html)
https://journals.plos.org/digitalhealth/article/fi ... 01222&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pdig00:0001222

DOI: 10.1371/journal.pdig.0001222

Access Statistics for this article

More articles in PLOS Digital Health from Public Library of Science
Bibliographic data for series maintained by digitalhealth ().