Putting AI agents through their paces on general tasks

Perez-Cruz, Fernando; Shin, Hyun Song

Putting AI agents through their paces on general tasks

Fernando Perez-Cruz and Hyun Song Shin

No 1245, BIS Working Papers from Bank for International Settlements

Abstract: Multimodal large language models (LLMs), trained on vast datasets are becoming increasingly capable in many settings. However, the capabilities of such models are typically evaluated in narrow tasks, much like standard machine learning models trained for specific objectives. We take a different tack by putting the latest LLM agents through their paces in general tasks involved in solving three popular games - Wordle, Face Quiz and Flashback. These games are easily tackled by humans but they demand a degree of self-awareness and higher-level abilities to experiment, to learn from mistakes and to plan accordingly. We find that the LLM agents display mixed performance in these general tasks. They lack the awareness to learn from mistakes and the capacity for self-correction. LLMs' performance in the most complex cognitive subtasks may not be the limiting factor for their deployment in real-world environments. Instead, it would be important to evaluate the capabilities of AGI-aspiring LLMs through general tests that encompass multiple cognitive tasks, enabling them to solve complete, real-world applications.

Keywords: AI Agents; LLMs evaluation (search for similar items in EconPapers)
JEL-codes: C88 (search for similar items in EconPapers)
Date: 2025-02
New Economics Papers: this item is included in nep-cmp and nep-neu
References: View references in EconPapers View complete reference list from CitEc
Citations: View citations in EconPapers (1)

Downloads: (external link)
https://www.bis.org/publ/work1245.pdf Full PDF document (application/pdf)
https://www.bis.org/publ/work1245.htm (text/html)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:bis:biswps:1245

Access Statistics for this paper

More papers in BIS Working Papers from Bank for International Settlements Contact information at EDIRC.
Bibliographic data for series maintained by Martin Fessler ().