Acta Informatica Pragensia X:X | DOI: 10.18267/j.aip.30066

SKR1: Benchmark for Testing Knowledge About Slovak Realia for Large Language Models

Marek Dobeš ORCID...
Centre of Social and Psychological Sciences, Slovak Academy of Sciences, Košice, Slovak Republic

Background: To objectively evaluate the capabilities of large language models (LLMs), we need to develop tools that enable such assessment. While numerous benchmarks exist, the vast majority are in English and focus on general knowledge, often overlooking the cultural and factual specifics of smaller countries.

Objective: Currently, there is no benchmark that tests LLMs΄ knowledge of Slovak realia. At the same time, LLM performance in this domain remains inadequate. To objectively measure and compare these capabilities, our goal is to develop and validate a specialized benchmark for assessing LLMs΄ knowledge of Slovak cultural and factual context.

Methods: We created a set of 35 questions on Slovak culture, geography, history and language. We designed them to provide unambiguous answers suitable for automated evaluation. Subsequently, we presented the questions to three major language models—DeepSeek V3, OpenAI GPT-4o and Llama 3.

Results: DeepSeek scored 54% of correct answers, OpenAI GPT scored 51% and Llama scored 40%. The models scored best in geography questions. Overall scores show that models are not very good in recognising Slovak realia.

Conclusion: We present the benchmark for evaluating large language models on Slovak-related knowledge. Even the most advanced current models, including OpenAI GPT and DeepSeek, answered only around half of the questions correctly. This highlights a significant gap in international LLMs΄ understanding of culturally specific facts, underscoring the need for specialized, nationally tailored language models.

Keywords: LLM; Benchmark; Slovak realia; DeepSeek; OpenAI GPT; Llama.

Received: August 5, 2025; Revised: December 6, 2025; Accepted: January 3, 2026; Prepublished online: February 18, 2026 

Download citation

References

  1. Belay, T. D., Ahmed, A. H., Grissom II, A., Ameer, I., Sidorov, G., Kolesnikova, O., & Yimam, S. M. (2025). CULEMO: Cultural lenses on emotion - Benchmarking LLMs for cross-cultural emotion understanding. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 18894-18909). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.925 Go to original source...
  2. Chiu, Y. Y., Jiang, L., Lin, B. Y., Park, C. Y., Li, S. S., Ravi, S., ... & Choi, Y. (2025). CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs' Cultural Knowledge Through Human-AI Red-Teaming. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), (pp. 25663-25701). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.1247 Go to original source...
  3. Florin, S. (1993). Realia in translation. In Translation as social action, (pp. 122-128). Routledge. Go to original source...
  4. Myung, J., Lee, N., Zhou, Y., Jin, J., Putri, R., Antypas, D., ... & Oh, A. (2024). BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages. Advances in Neural Information Processing Systems, 37, 78104-78146. https://doi.org/10.52202/079017-2483 Go to original source...
  5. Nedergaard-Larsen, B. (1993). Culture-bound problems in subtitling. Perspectives, 1(2), 207-240. https://doi.org/10.1080/0907676x.1993.9961214 Go to original source...
  6. Pistilli, G., Leidinger, A., Jernite, Y., Kasirzadeh, A., Luccioni, A. S., & Mitchell, M. (2024). CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, (pp. 1132-1144). ACM. Go to original source...
  7. Shafique, B. S., Vayani, A., Maaz, M., Rasheed, H. A., Dissanayake, D., Kurpath, M. I., ... & Khan, F. S. (2025). A Culturally-diverse Multilingual Multimodal Video Benchmark & Model. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, (pp. 19998-20022). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-main.1012 Go to original source...
  8. Šuppa, M., Ridzik, A., Hládek, D., Javorek, T., Ondrejová, V., Sásiková, K., Tamajka, M., & Šimko, M. (2025). skLEP: A Slovak general language understanding benchmark. In Findings of the Association for Computational Linguistics: ACL 2025, (pp. 26716-26743). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.findings-acl.1371 Go to original source...
  9. Vayani, A., Dissanayake, D., Watawana, H., Ahsan, N., Sasikumar, N., Thawakar, O., ... & Khan, F. S. (2025). All languages matter: Evaluating lmms on culturally diverse 100 languages. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 19565-19575). IEEE. https://doi.org/10.1109/CVPR52734.2025.01822 Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.