Putting AI agents through their paces on general tasks
Fernando Perez-Cruz and
Hyun Song Shin
No 1245, BIS Working Papers from Bank for International Settlements
Abstract:
Multimodal large language models (LLMs), trained on vast datasets are becoming increasingly capable in many settings. However, the capabilities of such models are typically evaluated in narrow tasks, much like standard machine learning models trained for specific objectives. We take a different tack by putting the latest LLM agents through their paces in general tasks involved in solving three popular games - Wordle, Face Quiz and Flashback. These games are easily tackled by humans but they demand a degree of self-awareness and higher-level abilities to experiment, to learn from mistakes and to plan accordingly. We find that the LLM agents display mixed performance in these general tasks. They lack the awareness to learn from mistakes and the capacity for self-correction. LLMs' performance in the most complex cognitive subtasks may not be the limiting factor for their deployment in real-world environments. Instead, it would be important to evaluate the capabilities of AGI-aspiring LLMs through general tests that encompass multiple cognitive tasks, enabling them to solve complete, real-world applications.
Keywords: AI Agents; LLMs evaluation (search for similar items in EconPapers)
JEL-codes: C88 (search for similar items in EconPapers)
Date: 2025-02
New Economics Papers: this item is included in nep-cmp and nep-neu
References: Add references at CitEc
Citations:
Downloads: (external link)
https://www.bis.org/publ/work1245.pdf Full PDF document (application/pdf)
https://www.bis.org/publ/work1245.htm (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:bis:biswps:1245
Access Statistics for this paper
More papers in BIS Working Papers from Bank for International Settlements Contact information at EDIRC.
Bibliographic data for series maintained by Martin Fessler ().