Comparison of local large language models for extraction of signs and symptoms data from electronic health records

Spiero, Isa; Rijk, Merijn H; Scheeres, Matthew A; Rutten, Frans H; Geersing, Geert-Jan; Platteel, Tamara N; Moons, Karel GM; Hooft, Lotty; Damen, Johanna AA; Venekamp, Roderick P; Leeuwenberg, Artuur M

Comparison of local large language models for extraction of signs and symptoms data from electronic health records

Isa Spiero, Merijn H Rijk, Matthew A Scheeres, Frans H Rutten, Geert-Jan Geersing, Tamara N Platteel, Karel GM Moons, Lotty Hooft, Johanna AA Damen, Roderick P Venekamp and Artuur M Leeuwenberg

PLOS ONE, 2026, vol. 21, issue 6, 1-13

Abstract: Electronic health records (EHRs) provide a large source of data that can be used for research purposes. Extraction of information from unstructured clinical notes in EHRs can be automated by large language models (LLMs). Although LLMs are promising for this task, challenges remain in reliable application of LLMs to EHR, including the lack of development and validation for languages other than English. Here, we identified Dutch LLMs and compared their performance in a case study. We selected the MedRoBERTa.nl and RobBERT models based on local applicability, Dutch language compatibility, and model architecture. We evaluated their performance in a case study on the extraction of signs and symptoms from comprehensive Dutch primary care EHRs of patients with a lower respiratory tract infection. Using manually annotated clinical notes, models were trained as direct and prompt-based classifiers with varying amounts of training samples. Performance was expressed by precision, recall, and F1-score. The MedROBERTa.nl and RobBERT models showed good performance as direct classifiers, with a macro-averaged F1-score of 0.74 (range 0.56–0.87) and 0.69 (range 0.46–0.86) using 1600 training samples, respectively. The prompt-based classifiers performed worse with F1-scores of 0.08 (range 0.02–0.30) and 0.08 (range 0.02–0.22), respectively. In general, performance of the models was negatively affected by class imbalance and missingness of signs and symptoms. A minimum of 800 annotated training samples were required to obtain sufficient performance. The selected LLMs showed good performance as direct classifiers in extracting signs and symptoms from Dutch primary care EHRs. However, prompt-based models require performance improvement by further prompt engineering, and caution is warranted with imbalanced or partially missing EHR data.MedROBERTa.nl and RobBERT models, used as direct classifiers, can be considered for clinical research to extract information from clinical notes from Dutch primary care EHRs, potentially reducing manual annotation time and accelerating real-world research and evidence generation.

Date: 2026
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0350625 (text/html)
https://journals.plos.org/plosone/article/file?id= ... 50625&type=printable (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:plo:pone00:0350625

DOI: 10.1371/journal.pone.0350625

Access Statistics for this article

More articles in PLOS ONE from Public Library of Science
Bibliographic data for series maintained by plosone ().