Testing theory of mind in large language models and humans
James W. A. Strachan (),
Dalila Albergo,
Giulia Borghini,
Oriana Pansardi,
Eugenio Scaliti,
Saurabh Gupta,
Krati Saxena,
Alessandro Rufo,
Stefano Panzeri,
Guido Manzi,
Michael S. A. Graziano and
Cristina Becchio ()
Additional contact information
James W. A. Strachan: University Medical Center Hamburg-Eppendorf
Dalila Albergo: Italian Institute of Technology
Giulia Borghini: Italian Institute of Technology
Oriana Pansardi: University Medical Center Hamburg-Eppendorf
Eugenio Scaliti: University Medical Center Hamburg-Eppendorf
Saurabh Gupta: Alien Technology Transfer Ltd
Krati Saxena: Alien Technology Transfer Ltd
Alessandro Rufo: Alien Technology Transfer Ltd
Stefano Panzeri: University Medical Center Hamburg- Eppendorf
Guido Manzi: Alien Technology Transfer Ltd
Michael S. A. Graziano: Princeton University
Cristina Becchio: University Medical Center Hamburg-Eppendorf
Nature Human Behaviour, 2024, vol. 8, issue 7, 1285-1295
Abstract:
Abstract At the core of what defines us as humans is the concept of theory of mind: the ability to track other people’s mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.
Date: 2024
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (2)
Downloads: (external link)
https://www.nature.com/articles/s41562-024-01882-z Abstract (text/html)
Access to the full text of the articles in this series is restricted.
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:nat:nathum:v:8:y:2024:i:7:d:10.1038_s41562-024-01882-z
Ordering information: This journal article can be ordered from
https://www.nature.com/nathumbehav/
DOI: 10.1038/s41562-024-01882-z
Access Statistics for this article
Nature Human Behaviour is currently edited by Stavroula Kousta
More articles in Nature Human Behaviour from Nature
Bibliographic data for series maintained by Sonal Shukla () and Springer Nature Abstracting and Indexing ().