EconPapers    
Economics at your fingertips  
 

Assessing the generalisation of artificial intelligence across mammography manufacturers

Alistair J Hickman, Sandra Gomes, Lucy M Warren, Nadia AS Smith and Caroline Shenton-Taylor

PLOS Digital Health, 2025, vol. 4, issue 8, 1-12

Abstract: The aim of this study was to determine whether differences between manufacturer of mammogram images effects performance of artificial intelligence tools for classifying breast density. Processed mammograms from 10,156 women were used to train and validate three deep learning algorithms using three retrospective datasets: Hologic, General Electric, Mixed (equal numbers of Hologic, General Electric and Siemens images) and tested on four independent witheld test sets (Hologic, General Electric, Mixed and Siemens). The area under the receiver operating characteristic curve (AUC) was compared. Women aged 47-73 with normal breasts (routine recall - no cancer) and Volpara ground truth were selected from the OPTIMAM Mammography Image Database for the years 2012-2015. 95 % confidence intervals are used for significance testing in the results with a Bayesian Signed Rank test used to rank the overall performance of the models. Best single test performance is seen when a model is trained and tested on images from a single manufacturer (Hologic train/test: 0.98 and General Electric train/test: 0.97), however the same models performed significantly worse on any other manufacturer images (General Electric AUCs: 0.68 & 0.63; Hologic AUCs: 0.56 & 0.90). The model trained on the mixed dataset exhibited the best overall performance. Better performance occurs when training and test sets contain the same manufacturer distributions and better generalisation occurs when more manufacturers are included in training. Models in clinical use should be trained on data representing the different vendors of mammogram machines used across screening programs. This is clinically relevant as models will be impacted by changes and upgrades to mammogram machines in screening centres.Author summary: A number of manufacturers of mammogram machines are in use within the NHS Breast Screening Program. Naturally some of these manufacturers use different technologies to acquire the mammograms. These mammograms are made readable through the application of processing to the raw information from the X-ray detector, which is known to vary both inter- and intra- manufacturer. The aim of this study was to assess whether these differences impact the performance of AI classification algorithms. We trained three binary classifiers on three different datasets, two from single manufacturers and one with an even mix of three manufacturers. Models trained on single manufacturer data could not generalise their knowledge to manufacturers unseen in training. The model trained on three manufacturers was the best overall performer. In general models must be trained on images from any manufacturers in the desired clinical setting as there are sufficient differences between manufacturers that AI algorithms cannot transfer their knowledge to a mammogram from an unseen manufacturer. Models must also be monitored and kept up to date to reflect any changes to mammogram machines within the clinical setting.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000973 (text/html)
https://journals.plos.org/digitalhealth/article/fi ... 00973&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pdig00:0000973

DOI: 10.1371/journal.pdig.0000973

Access Statistics for this article

More articles in PLOS Digital Health from Public Library of Science
Bibliographic data for series maintained by digitalhealth ().

 
Page updated 2025-08-16
Handle: RePEc:plo:pdig00:0000973