Reliability-enhanced data cleaning in biomedical machine learning using inductive conformal prediction
Xianghao Zhan,
Qinmei Xu,
Yuanning Zheng,
Guangming Lu and
Olivier Gevaert
PLOS Computational Biology, 2025, vol. 21, issue 2, 1-27
Abstract:
Accurately labeling large datasets is important for biomedical machine learning yet challenging while modern data augmentation methods may generate noise in the training data, which may deteriorate machine learning model performance. Existing approaches addressing noisy training data typically rely on strict modeling assumptions, classification models and well-curated dataset. To address these, we propose a novel reliability-based training-data-cleaning method employing inductive conformal prediction (ICP). This method uses a small set of well-curated training data and leverages ICP-calculated reliability metrics to selectively correct mislabeled data and outliers within vast quantities of noisy training data. The efficacy is validated across three classification tasks with distinct modalities: filtering drug-induced-liver-injury (DILI) literature with free-text title and abstract, predicting ICU admission of COVID-19 patients through CT radiomics and electronic health records, and subtyping breast cancer using RNA-sequencing data. Varying levels of noise to the training labels were introduced via label permutation. Our training-data-cleaning method significantly enhanced the downstream classification performance (paired t-tests, p ≤ 0 . 05 among 30 random train/test partitions): significant accuracy enhancement in 86 out of 96 DILI experiments (up to 11.4% increase from 0.812 to 0.905), significant AUROC and AUPRC enhancements in all 48 COVID-19 experiments (up to 23.8% increase from 0.597 to 0.739 for AUROC, and 69.8% increase from 0.183 to 0.311 for AUPRC), and significant accuracy and macro-average F1-score improvements in 47 out of 48 RNA-sequencing experiments (up to 74.6% increase from 0.351 to 0.613 for accuracy, and 89.0% increase from 0.267 to 0.505 for F1-score). The improvement can be both statistically and clinically significant for information retrieval, disease diagnosis and prognosis. The method offers the potential to substantially boost classification performance in biomedical machine learning tasks without necessitating an excessive volume of well-curated training data or strong data distribution and modeling assumptions in existing semi-supervised learning methods.Author summary: In biomedical machine learning, noisy training data often compromise the performance of models critical for clinical decision-making. Generating well-curated datasets is challenging, while noisy datasets are prevalent, especially with advanced data augmentation techniques. This study introduces a novel reliability-based training data-cleaning method employing inductive conformal prediction (ICP). Using a small, well-curated calibration set, the method identifies and corrects mislabeled samples and removes outliers, enhancing label quality without strong assumptions on data distribution or model structure. We validated the approach across three diverse tasks: filtering drug-induced liver injury (DILI) literature, predicting ICU admissions of COVID-19 patients from radiomics and clinical data, and subtyping breast cancer based on RNA-seq profiles. Results showed significant improvements in classification performance, even under varying levels of label noise. This method demonstrates a practical solution for leveraging large, noisy datasets in biomedical applications, reducing reliance on extensive manual labeling, and improving the reliability of machine-learning models across modalities. Our findings highlight the potential of ICP to advance data-cleaning strategies in noisy real-world settings.
Date: 2025
References: Add references at CitEc
Citations:
Downloads: (external link)
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1012803 (text/html)
https://journals.plos.org/ploscompbiol/article/fil ... 12803&type=printable (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:plo:pcbi00:1012803
DOI: 10.1371/journal.pcbi.1012803
Access Statistics for this article
More articles in PLOS Computational Biology from Public Library of Science
Bibliographic data for series maintained by ploscompbiol ().