Evaluating Reasoning in Large Language Models with a Modified Think-a-Number Game: Case Study

Hoza, Petr

Evaluating Reasoning in Large Language Models with a Modified Think-a-Number Game: Case Study

Petr Hoza

Acta Informatica Pragensia, 2025, vol. 2025, issue 2, 246-260

Abstract: Background: Large language models (LLMs) excel at various tasks but often encounter difficulties when extended reasoning requires maintaining a consistent internal state. Identifying the threshold at which these systems fail under increasing task complexity is essential for reliable deployment. Objective: The primary objective was to examine whether four LLMs (GPT 3.5, GPT 4, GPT 4o-mini and GPT 4o) could preserve a hidden number and its arithmetic transformation across multiple yes/no queries and to determine whether a specific point of reasoning breakdown exists. Methods: A modified "Think a Number" game was employed, with complexity defined by the number of sequential yes/no queries (ranging from 1 to 9 or 11). Seven prompting strategies, including chain-of-thought variants, counterfactual prompts and few-shot examples, were evaluated. Each outcome was considered correct if the revealed number and transformation of the model remained consistent with prior answers. Results: Analysis of tens of thousands of trials showed no distinct performance cliff up to 9-11 queries, indicating that modern LLMs are more capable of consecutive reasoning than previously assumed. Counterfactual and certain chain-of-thought prompts outperformed simpler baselines. GPT 4o and GPT 4o-mini attained higher overall correctness, whereas GPT 3.5 and GPT 4 more often displayed contradictory or premature disclosures. Conclusion: In a controlled, scalable reasoning scenario, these LLMs demonstrated notable resilience to multi-step prompts. Both prompt design and model selection significantly influenced performance. Further research involving more intricate tasks and higher query counts is recommended to delineate the upper boundaries of LLM internal consistency.

Keywords: LLM; Prompt engineering; AI; Artificial intelligence; Large language model; ChatGPT (search for similar items in EconPapers)
Date: 2025
References: Add references at CitEc
Citations:

Downloads: (external link)
http://aip.vse.cz/doi/10.18267/j.aip.273.html (text/html)
http://aip.vse.cz/doi/10.18267/j.aip.273.pdf (application/pdf)
free of charge

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:prg:jnlaip:v:2025:y:2025:i:2:id:273:p:246-260

Ordering information: This journal article can be ordered from
Redakce Acta Informatica Pragensia, Katedra systémové analýzy, Vysoká škola ekonomická v Praze, nám. W. Churchilla 4, 130 67 Praha 3
http://aip.vse.cz

DOI: 10.18267/j.aip.273

Access Statistics for this article

Acta Informatica Pragensia is currently edited by Editorial Office

More articles in Acta Informatica Pragensia from Prague University of Economics and Business Contact information at EDIRC.
Bibliographic data for series maintained by Stanislav Vojir ().