EconPapers    
Economics at your fingertips  
 

Active label cleaning for improved dataset quality under resource constraints

Mélanie Bernhardt, Daniel C. Castro, Ryutaro Tanno, Anton Schwaighofer, Kerem C. Tezcan, Miguel Monteiro, Shruthi Bannur, Matthew P. Lungren, Aditya Nori, Ben Glocker, Javier Alvarez-Valle and Ozan Oktay ()
Additional contact information
Mélanie Bernhardt: Health Intelligence, Microsoft Research Cambridge
Daniel C. Castro: Health Intelligence, Microsoft Research Cambridge
Ryutaro Tanno: Health Intelligence, Microsoft Research Cambridge
Anton Schwaighofer: Health Intelligence, Microsoft Research Cambridge
Kerem C. Tezcan: Health Intelligence, Microsoft Research Cambridge
Miguel Monteiro: Health Intelligence, Microsoft Research Cambridge
Shruthi Bannur: Health Intelligence, Microsoft Research Cambridge
Matthew P. Lungren: Stanford University
Aditya Nori: Health Intelligence, Microsoft Research Cambridge
Ben Glocker: Health Intelligence, Microsoft Research Cambridge
Javier Alvarez-Valle: Health Intelligence, Microsoft Research Cambridge
Ozan Oktay: Health Intelligence, Microsoft Research Cambridge

Nature Communications, 2022, vol. 13, issue 1, 1-11

Abstract: Abstract Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation—which we term “active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality.

Date: 2022
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.nature.com/articles/s41467-022-28818-3 Abstract (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-28818-3

Ordering information: This journal article can be ordered from
https://www.nature.com/ncomms/

DOI: 10.1038/s41467-022-28818-3

Access Statistics for this article

Nature Communications is currently edited by Nathalie Le Bot, Enda Bergin and Fiona Gillespie

More articles in Nature Communications from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().

 
Page updated 2025-03-19
Handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-28818-3