Acta Informatica Pragensia 2025, 14(2), 246-260 | DOI: 10.18267/j.aip.2731331

Evaluating Reasoning in Large Language Models with a Modified Think-a-Number Game: Case Study

Petr Hoza ORCID...
Department of Information and Knowledge Engineering, Faculty of Informatics and Statistics, Prague University of Economics and Business, Prague, Czech Republic

Background: Large language models (LLMs) excel at various tasks but often encounter difficulties when extended reasoning requires maintaining a consistent internal state. Identifying the threshold at which these systems fail under increasing task complexity is essential for reliable deployment.

Objective: The primary objective was to examine whether four LLMs (GPT 3.5, GPT 4, GPT 4o-mini and GPT 4o) could preserve a hidden number and its arithmetic transformation across multiple yes/no queries and to determine whether a specific point of reasoning breakdown exists.

Methods: A modified “Think a Number” game was employed, with complexity defined by the number of sequential yes/no queries (ranging from 1 to 9 or 11). Seven prompting strategies, including chain-of-thought variants, counterfactual prompts and few-shot examples, were evaluated. Each outcome was considered correct if the revealed number and transformation of the model remained consistent with prior answers.

Results: Analysis of tens of thousands of trials showed no distinct performance cliff up to 9–11 queries, indicating that modern LLMs are more capable of consecutive reasoning than previously assumed. Counterfactual and certain chain-of-thought prompts outperformed simpler baselines. GPT 4o and GPT 4o-mini attained higher overall correctness, whereas GPT 3.5 and GPT 4 more often displayed contradictory or premature disclosures.

Conclusion: In a controlled, scalable reasoning scenario, these LLMs demonstrated notable resilience to multi-step prompts. Both prompt design and model selection significantly influenced performance. Further research involving more intricate tasks and higher query counts is recommended to delineate the upper boundaries of LLM internal consistency.

Keywords: LLM; Prompt engineering; AI; Artificial intelligence; Large language model; ChatGPT.

Received: February 1, 2025; Revised: April 15, 2025; Accepted: June 3, 2025; Prepublished online: June 29, 2025; Published: July 26, 2025  Show citation

ACS AIP APA ASA Harvard Chicago Chicago Notes IEEE ISO690 MLA NLM Turabian Vancouver
Hoza, P. (2025). Evaluating Reasoning in Large Language Models with a Modified Think-a-Number Game: Case Study. Acta Informatica Pragensia14(2), 246-260. doi: 10.18267/j.aip.273
Download citation

References

  1. Baddeley, A. (2012). Working Memory: Theories, Models, and Controversies. Annual Review of Psychology, 63, 1-29. https://doi.org/10.1146/annurev-psych-120710-100422 Go to original source...
  2. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., Arx, S. von, Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., … Liang, P. (2022). On the Opportunities and Risks of Foundation Models (No. arXiv:2108.07258). arXiv Preprint. https://doi.org/10.48550/arXiv.2108.07258 Go to original source...
  3. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language Models are Few-Shot Learners (No. arXiv:2005.14165). arXiv Preprint. https://doi.org/10.48550/arXiv.2005.14165 Go to original source...
  4. Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., Nori, H., Palangi, H., Ribeiro, M. T., & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early experiments with GPT-4 (No. arXiv:2303.12712). arXiv Preprint. https://doi.org/10.48550/arXiv.2303.12712 Go to original source...
  5. Chen, J., Chen, L., Huang, H., & Zhou, T. (2023). When do you need Chain-of-Thought Prompting for ChatGPT? (No. arXiv:2304.03262). arXiv Preprint. https://doi.org/10.48550/arXiv.2304.03262 Go to original source...
  6. Chu, Z., Chen, J., Chen, Q., Yu, W., He, T., Wang, H., Peng, W., Liu, M., Qin, B., & Liu, T. (2024). Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, (pp. 1173-1203. ACL. https://aclanthology.org/2024.acl-long.65.pdf Go to original source...
  7. Cowan, N. (2001). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and brain sciences, 24(1), 87-185. https://doi.org/10.1017/s0140525x01003922 Go to original source...
  8. Creswell, A., Shanahan, M., & Higgins, I. (2022). Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. In The Eleventh International Conference on Learning Representations (ICLR 2023). https://openreview.net/forum?id=3Pf3Wg6o-A4
  9. Diao, S., Wang, P., Lin, Y., Pan, R., Liu, X., & Zhang, T. (2024). Active Prompting with Chain-of-Thought for Large Language Models (No. arXiv:2302.12246). arXiv Preprint. https://doi.org/10.48550/arXiv.2302.12246 Go to original source...
  10. Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Liu, T., Chang, B., Sun, X., Li, L., & Sui, Z. (2024). A Survey on In-context Learning (No. arXiv:2301.00234). arXiv Preprint. https://doi.org/10.48550/arXiv.2301.00234 Go to original source...
  11. Gao, A. (2023). Prompt Engineering for Large Language Models. SSRN. http://dx.doi.org/10.2139/ssrn.4504303 Go to original source...
  12. Gong, D., Wan, X., & Wang, D. (2024). Working Memory Capacity of ChatGPT: An Empirical Study. Proceedings of the AAAI Conference on Artificial Intelligence, 38(9), Article 9. https://doi.org/10.1609/aaai.v38i9.28868 Go to original source...
  13. Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22), (pp. 22199-22213). Curran Associates. https://dl.acm.org/doi/10.5555/3600270.3601883
  14. Li, J., Li, G., Li, Y., & Jin, Z. (2025). Structured Chain-of-Thought prompting for code generation. ACM Transactions on Software Engineering and Methodology, 34(2), Article 37. https://doi.org/10.1145/3690635 Go to original source...
  15. Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, (pp. 3214-3252). ACL. https://doi.org/10.18653/v1/2022.acl-long.229 Go to original source...
  16. Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., & Chen, W. (2022). What Makes Good In-Context Examples for GPT-3?. In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, (pp. 100-114). ACL. https://doi.org/10.18653/v1/2022.deelio-1.10 Go to original source...
  17. Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2022). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, (pp. 8086-8098). ACL. https://aclanthology.org/2022.acl-long.556 Go to original source...
  18. Mishra, S., Khashabi, D., Baral, C., Choi, Y., & Hajishirzi, H. (2021). Reframing instructional prompts to GPTk's language (arXiv:2109.07830). arXiv Preprint. http://arxiv.org/abs/2109.07830 Go to original source...
  19. Patel, A., Bhattamishra, S., & Goyal, N. (2021). Are NLP models really able to solve simple math word problems?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 2080-2094). ACL. https://doi.org/10.18653/v1/2021.naacl-main.168 Go to original source...
  20. Raiyan, S. R., Faiyaz, M. N., Kabir, S. Md. J., Kabir, M., Mahmud, H., & Hasan, M. K. (2023). Math Word Problem Solving by Generating Linguistic Variants of Problem Statements. In V. Padmakumar, G. Vallejo, & Y. Fu (Ed.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, (pp. 362-378). ACL. https://doi.org/10.18653/v1/2023.acl-srw.49 Go to original source...
  21. Rodriguez, A. D. (2023). Prompts Matter: Insights and Strategies for Prompt Engineering in Automated Software Traceability. In 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW), (pp.455-464). IEEE. https://doi.org/10.1109/REW57809.2023.00087 Go to original source...
  22. Shanahan, M., McDonell, K., & Reynolds, L. (2023). Role play with large language models. Nature, 623(7987), 493-498. https://doi.org/10.1038/s41586-023-06647-8 Go to original source...
  23. Sun, Y., Yin, Z., Huang, X., Qiu, X., & Zhao, H. (2025). Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework (arXiv:2501.15581). arXiv Preprint. https://doi.org/10.48550/arXiv.2501.15581 Go to original source...
  24. Wang, R., Zelikman, E., Poesia, G., Pu, Y., Haber, N., & Goodman, N. D. (2023). Hypothesis search: Inductive reasoning with language models (No. arXiv:2309.05660; Version 1). arXiv Preprint. http://arxiv.org/abs/2309.05660
  25. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022a). Emergent abilities of large language models (No. arXiv:2206.07682). arXiv Preprint. https://doi.org/10.48550/arXiv.2206.07682 Go to original source...
  26. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022b). Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS 2022), (pp. 24824-24837). Curran Associates. https://dl.acm.org/doi/10.5555/3600270.3602070
  27. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR 2023). https://doi.org/10.48550/arXiv.2210.03629 Go to original source...
  28. Ye, X., & Durrett, G. (2022). The unreliability of explanations in few-shot prompting for textual reasoning. In Proceedings of the 36th International Conference on Neural Information Processing Systems (pp. 30378-30392). https://dl.acm.org/doi/10.5555/3600270.3602472
  29. Zhang, M., Qian, T., Zhang, T., & Miao, X. (2023). Towards Model Robustness: Generating Contextual Counterfactuals for Entities in Relation Extraction. In Proceedings of the ACM Web Conference 2023, (pp. 1832-1842). https://doi.org/10.1145/3543507.3583504 Go to original source...
  30. Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S., Li, J., Hu, R., Zhang, T., Wu, F., & Wang, G. (2024). Instruction tuning for large language models: A survey (No. arXiv:2308.10792). arXiv Preprint. https://doi.org/10.48550/arXiv.2308.10792 Go to original source...
  31. Zhao, T. Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38 th International Conference on Machine Learning, (pp. 12697-12706). MLR. https://proceedings.mlr.press/v139/zhao21c/zhao21c.pdf

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.