EconPapers    
Economics at your fingertips  
 

Benchmarking Large Language Models from Open and Closed Source Models to Apply Data Annotation for Free-Text Criteria in Healthcare

Ali Nemati, Mohammad Assadi Shalmani, Qiang Lu and Jake Luo ()
Additional contact information
Ali Nemati: Health Informatics Department, Zilber College of Public Health, University of Wisconsin, Milwaukee, WI 53211, USA
Mohammad Assadi Shalmani: Health Informatics Department, Zilber College of Public Health, University of Wisconsin, Milwaukee, WI 53211, USA
Qiang Lu: Beijing Key Laboratory of Petroleum Data Mining, China University of Petroleum, Beijing 102249, China
Jake Luo: Health Informatics & Administration Department, Zilber College of Public Health, University of Wisconsin, Milwaukee, WI 53211, USA

Future Internet, 2025, vol. 17, issue 4, 1-27

Abstract: Large language models (LLMs) hold the potential to significantly enhance data annotation for free-text healthcare records. However, ensuring their accuracy and reliability is critical, especially in clinical research applications requiring the extraction of patient characteristics. This study introduces a novel evaluation framework based on Multi-Criteria Decision Analysis (MCDA) and the Order of Preference by Similarity to Ideal Solution (TOPSIS) technique, designed to benchmark LLMs on their annotation quality. The framework defines ten evaluation metrics across key criteria such as age, gender, BMI, disease presence, and blood markers (e.g., white blood count and platelets). Using this methodology, we assessed leading open source and commercial LLMs, achieving accuracy scores of 0.59, 1, 0.84, 0.56, and 0.92, respectively, for the specified criteria. Our work not only provides a rigorous framework for evaluating LLM capabilities in healthcare data annotation but also highlights their current performance limitations and strengths. By offering a comprehensive benchmarking approach, we aim to support responsible adoption and decision-making in healthcare applications.

Keywords: large language models; healthcare data annotation; multi-criteria decision analysis; closed source and open source models; evaluation metrics; human and LLM evaluation; decision-making in healthcare (search for similar items in EconPapers)
JEL-codes: O3 (search for similar items in EconPapers)
Date: 2025
References: View references in EconPapers View complete reference list from CitEc
Citations:

Downloads: (external link)
https://www.mdpi.com/1999-5903/17/4/138/pdf (application/pdf)
https://www.mdpi.com/1999-5903/17/4/138/ (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:gam:jftint:v:17:y:2025:i:4:p:138-:d:1618786

Access Statistics for this article

Future Internet is currently edited by Ms. Grace You

More articles in Future Internet from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().

 
Page updated 2025-04-05
Handle: RePEc:gam:jftint:v:17:y:2025:i:4:p:138-:d:1618786