An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications

Offidani, Mauro Nievas; Roffet, Facundo; Galtier, María Carolina González; Massiris, Miguel; Delrieux, Claudio

An Open-Source Clinical Case Dataset for Medical Image Classification and Multimodal AI Applications

Mauro Nievas Offidani (), Facundo Roffet, María Carolina González Galtier, Miguel Massiris and Claudio Delrieux
Additional contact information
Mauro Nievas Offidani: Electrical and Computer Engineering Department, Universidad Nacional del Sur, Bahía Blanca 8000, Argentina
Facundo Roffet: Electrical and Computer Engineering Department, Universidad Nacional del Sur, Bahía Blanca 8000, Argentina
María Carolina González Galtier: Independent Researcher, Bialet Massé 5158, Argentina
Miguel Massiris: Electrical and Computer Engineering Department, Universidad Nacional del Sur, Bahía Blanca 8000, Argentina
Claudio Delrieux: Electrical and Computer Engineering Department, Universidad Nacional del Sur, Bahía Blanca 8000, Argentina

Data, 2025, vol. 10, issue 8, 1-21

Abstract: High-quality, openly accessible clinical datasets remain a significant bottleneck in advancing both research and clinical applications within medical artificial intelligence. Case reports, often rich in multimodal clinical data, represent an underutilized resource for developing medical AI applications. We present an enhanced version of MultiCaRe, a dataset derived from open-access case reports on PubMed Central. This new version addresses the limitations identified in the previous release and incorporates newly added clinical cases and images (totaling 93,816 and 130,791, respectively), along with a refined hierarchical taxonomy featuring over 140 categories. Image labels have been meticulously curated using a combination of manual and machine learning-based label generation and validation, ensuring a higher quality for image classification tasks and the fine-tuning of multimodal models. To facilitate its use, we also provide a Python package for dataset manipulation, pretrained models for medical image classification, and two dedicated websites. The updated MultiCaRe dataset expands the resources available for multimodal AI research in medicine. Its scale, quality, and accessibility make it a valuable tool for developing medical AI systems, as well as for educational purposes in clinical and computational fields.

Keywords: artificial intelligence; data curation; dataset; healthcare; image classification; medical imaging; medicine; multimodality; image captioning (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2025
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/2306-5729/10/8/123/pdf (application/pdf)
https://www.mdpi.com/2306-5729/10/8/123/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:10:y:2025:i:8:p:123-:d:1714176

Access Statistics for this article

Data is currently edited by Ms. Becky Zhang

More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().