Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof-of-Concept with COVID-19 Vaccines

Abate, Andrea; Poncato, Elisa; Barbieri, Maria Antonietta; Powell, Greg; Rossi, Andrea; Peker, Simay; Hviid, Anders; Bate, Andrew; Sessa, Maurizio

Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof-of-Concept with COVID-19 Vaccines

Andrea Abate, Elisa Poncato, Maria Antonietta Barbieri, Greg Powell, Andrea Rossi, Simay Peker, Anders Hviid, Andrew Bate and Maurizio Sessa ()
Additional contact information
Andrea Abate: University of Copenhagen
Elisa Poncato: University of Copenhagen
Maria Antonietta Barbieri: University of Copenhagen
Greg Powell: GSK
Andrea Rossi: University of Milan
Simay Peker: University of Copenhagen
Anders Hviid: University of Copenhagen
Andrew Bate: GSK
Maurizio Sessa: University of Copenhagen

Drug Safety, 2025, vol. 48, issue 7, No 8, 805-820

Abstract: Abstract Background This study evaluated the feasibility of ChatGPT and Gemini, two off-the-shelf large language models (LLMs), to automate causality assessments, focusing on Adverse Events Following Immunizations (AEFIs) of myocarditis and pericarditis related to COVID-19 vaccines. Methods We assessed 150 COVID-19-related cases of myocarditis and pericarditis reported to the Vaccine Adverse Event Reporting System (VAERS) in the United States of America (USA). Both LLMs and human experts conducted the World Health Organization (WHO) algorithm for vaccine causality assessments, and inter-rater agreement was measured using percentage agreement. Adherence to the WHO algorithm was evaluated by comparing LLM responses to the expected sequence of the algorithm. Statistical analyses, including descriptive statistics and Random Forest modeling, explored case complexity (e.g., string length measurements) and factors affecting LLM performance and adherence. Results ChatGPT showed higher adherence to the WHO algorithm (34%) compared to Gemini (7%) and had moderate agreement (71%) with human experts, whereas Gemini had fair agreement (53%). Both LLMs often failed to recognize listed AEFIs, with ChatGPT and Gemini incorrectly identifying 6.7% and 13.3% of AEFIs, respectively. ChatGPT showed inconsistencies in 8.0% of cases and Gemini in 46.7%. For ChatGPT, adherence to the algorithm was associated with lower string complexity in prompt sections. The random forest analysis achieved an accuracy of 55% (95% confidence interval: 35.7–73.5) for predicting adherence to the WHO algorithm for ChatGPT. Conclusion Notable limitations of ChatGPT and Gemini have been identified in their use for aiding causality assessments in vaccine safety. ChatGPT performed better, with higher adherence and agreement with human experts. In the investigated scenario, both models are better suited as complementary tools to human expertise.

Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
http://link.springer.com/10.1007/s40264-025-01531-y Abstract (text/html)
Access to the full text of the articles in this series is restricted.

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:spr:drugsa:v:48:y:2025:i:7:d:10.1007_s40264-025-01531-y

Ordering information: This journal article can be ordered from
http://www.springer.com/adis/journal/40264

DOI: 10.1007/s40264-025-01531-y

Access Statistics for this article

Drug Safety is currently edited by Nitin Joshi

More articles in Drug Safety from Springer
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().