Acta Informatica Pragensia X:X | DOI: 10.18267/j.aip.303189
Artificial Intelligence in Software Testing and Beyond: A Review of Current Practices and Emerging Challenges
- Department of Industrial Engineering and Management, Faculty of Engineering, Lucian Blaga University of Sibiu, Sibiu, Romania
Background: Artificial intelligence (AI) is increasingly used both to test software (T1) and to assure AI-based systems (T2), with adjacent software-engineering work that shapes testing practice (T3). Prior reviews are mostly descriptive and rarely report comparable maturity or replicability signals.
Objective: To provide a PRISMA-style systematic review (2015–2025, Web of Science) that maps T1–T2–T3 within a testing-centric frame, audits evidence maturity, threats reporting, and artefact openness per paper, and adds an explicit lens of large language models or generative AI (LLMs/GenAI).
Methods: We queried the Web of Science Core Collection (2015–2025), screened via a predefined protocol, and extracted ten items (D1–D10) per study to normalize comparisons. Seventy-two papers met the criteria. Findings are organized into three themes: (T1) AI-based software testing, (T2) testing/validation of AI systems, and (T3) AI-related software engineering topics with implications for testing—T3 corresponding to the “beyond” in the paper’s title.
Results: The corpus is limited in practice-oriented evidence: 31 laboratory/simulation, 3 industrial, 10 hybrid, 6 conceptual/guideline and 22 secondary studies. Only 18/72 provide public artefacts; 33/72 report no empirical metrics. By theme, T1=32, T2=15, T3=25; the LLMs/GenAI subset totals 10 papers. Openness strongly co-occurs with measurable outcomes (88.9% of artefact-sharing papers report metrics vs 42.6% without), yet “all-three credible” studies (industrial/hybrid + open artefacts + metrics) are rare (4/72 overall; 1/10 for LLMs/GenAI).
Conclusion: AI shows promise for testing, but evidence remains thin on industrial adoption and reproducibility. We recommend prioritizing hybrid/industrial validations, releasing artefacts by default, and using standardized task–metric bundles. The review presents T1 and T2 results, separates T3 for scope clarity, and provides actionable maturity and replicability signals to guide responsible, empirical adoption.
Keywords: Software testing; Artificial intelligence; AI; AI-driven testing; Software engineering; Requirements engineering; Human-AI collaboration; Software quality; Large Language Models; LLMs.
Received: August 29, 2025; Revised: January 8, 2026; Accepted: January 19, 2026; Prepublished online: April 9, 2026
References
- Abo-eleneen, A., Palliyali, A., & Catal, C. (2023). The role of reinforcement learning in software testing. Information and Software Technology, 164, 107325. https://doi.org/10.1016/j.infsof.2023.107325
Go to original source... - Ahmad, K., Abdelrazek, M., Arora, C., Bano, M., & Grundy, J. (2023a). Requirements engineering framework for human-centered artificial intelligence software systems. Applied Soft Computing, 143, 110455. https://doi.org/10.1016/j.asoc.2023.110455
Go to original source... - Ahmad, K., Abdelrazek, M., Arora, C., Bano, M., & Grundy, J. (2023b). Requirements practices and gaps when engineering human-centered Artificial Intelligence systems. Applied Soft Computing, 143, 110421. https://doi.org/10.1016/j.asoc.2023.110421
Go to original source... - Alenezi, M., & Akour, M. (2025). AI-Driven Innovations in Software Engineering: A review of current practices and future directions. Applied Sciences, 15(3), 1344. https://doi.org/10.3390/app15031344
Go to original source... - Ali, M., Mazhar, T., Shahzad, T., Ghadi, Y. Y., Mohsin, S. M., Akber, S. M. A., & Ali, M. (2023). Analysis of feature selection methods in software defect prediction models. IEEE Access, 11, 145954-145974. https://doi.org/10.1109/access.2023.3343249
Go to original source... - Almeida, Y., Albuquerque, D., Filho, E. D., Muniz, F., De Farias Santos, K., Perkusich, M., Almeida, H., & Perkusich, A. (2024). AICodeReview: Advancing code quality with AI-enhanced reviews. SoftwareX, 26, 101677. https://doi.org/10.1016/j.softx.2024.101677
Go to original source... - Alnafessah, A., Ul Gias, A., Wang, R., Zhu, L., Casale, G., & Filieri, A. (2021). Quality-aware DevOps research: Where do we stand? IEEE Access, 9, 44476-44489. https://doi.org/10.1109/ACCESS.2021.3064867
Go to original source... - Amalfitano, D., Faralli, S., Hauck, J. C. R., Matalonga, S., & Distante, D. (2024). Artificial intelligence applied to software testing: A tertiary study. ACM Computing Surveys, 56(3), Article 58. https://doi.org/10.1145/3616372
Go to original source... - Anwar, R., & Bashir, M. B. (2023). A systematic literature review of AI-based software requirements prioritization techniques. IEEE Access, 11, 143815-143860. https://doi.org/10.1109/ACCESS.2023.3343252
Go to original source... - Augusto, C., Moran, J., Riva, C., & Tuya, J. (2019). Test-driven Anonymization for Artificial Intelligence. In 2019 IEEE International Conference on Artificial Intelligence Testing (AITest), (pp. 103-110). IEEE. https://doi.org/10.1109/AITest.2019.00011
Go to original source... - Baumgartner, N., Iyenghar, P., Schoemaker, T., & Pulvermüller, E. (2024). AI-Driven Refactoring: A pipeline for identifying and correcting data clumps in Git repositories. Electronics, 13(9), 1644. https://doi.org/10.3390/electronics13091644
Go to original source... - Borg, M. (2021). The AIQ Meta-Testbed: Pragmatically Bridging Academic AI Testing and Industrial Q Needs. In Software Quality: Future Perspectives on Software Engineering Quality, (pp. 66-77). Springer. https://doi.org/10.1007/978-3-030-65854-0_6
Go to original source... - Boukhlif, M., Hanine, M., & Kharmoum, N. (2023). A decade of intelligent software testing research: A bibliometric analysis. Electronics, 12(9), 2109. https://doi.org/10.3390/electronics12092109
Go to original source... - Cheng, K. S., Huang, P., Ahn, T., & Song, M. (2023). Tool support for improving software quality in machine learning programs. Information, 14(1), 53. https://doi.org/10.3390/info14010053
Go to original source... - Deng, J., Lu, L., & Qiu, S. (2020). Software defect prediction via LSTM. IET Software, 14(4), 443-450. https://doi.org/10.1049/iet-sen.2019.0149
Go to original source... - Durrani, U. K., Akpinar, M., Adak, M. F., Kabakus, A. T., Öztürk, M. M., & Saleh, M. (2024). A Decade of Progress: A systematic literature review on the integration of AI in software engineering phases and activities (2013-2023). IEEE Access, 12, 171185-171204. https://doi.org/10.1109/access.2024.3488904
Go to original source... - Felderer, M., & Ramler, R. (2021). Quality Assurance for AI-Based Systems: Overview and Challenges (Introduction to Interactive Session). In Software Quality: Future Perspectives on Software Engineering Quality, (pp. 33-42). Springer. https://doi.org/10.1007/978-3-030-65854-0_3
Go to original source... - Fernandes, P., Lopes, M., & Prada, R. (2021). Agents for Automated User Experience Testing. In 2021 IEEE International Conference on Software Testing, Verification and Validation Workshops, (pp. 247-253). IEEE. https://doi.org/10.1109/ICSTW52544.2021.00049
Go to original source... - Ferrara, C., Sellitto, G., Ferrucci, F., Palomba, F., & De Lucia, A. (2024). Fairness-aware machine learning engineering: How far are we? Empirical Software Engineering, 29(1), Article 9. https://doi.org/10.1007/s10664-023-10402-y
Go to original source... - Filus, K., & Domanska, J. (2023). Software vulnerabilities in TensorFlow-based deep learning applications. Computers & Security, 124, 102948. https://doi.org/10.1016/j.cose.2022.102948
Go to original source... - Fischer, L., Ehrlinger, L., Geist, V., Ramler, R., Sobiezky, F., Zellinger, W., Brunner, D., Kumar, M., & Moser, B. (2021). AI system engineering-Key challenges and lessons learned. Machine Learning and Knowledge Extraction, 3(1), 56-83. https://doi.org/10.3390/make3010004
Go to original source... - Garrad, P., & Unnikrishnan, S. (2023). Reinforcement learning in VANET penetration testing. Results in Engineering, 17, 100970. https://doi.org/10.1016/j.rineng.2023.100970
Go to original source... - Giray, G. (2021). A software engineering perspective on engineering machine learning systems: State of the art and challenges. Journal of Systems and Software, 180, 111031. https://doi.org/10.1016/j.jss.2021.111031
Go to original source... - Giuliano M., A., Martin-Lopez, A., Segura, S., Valencia-Cabrera, L., & Ruiz-Cortes, A. (2021). Deep Learning-Based Prediction of Test Input Validity for RESTful APIs. In 2021 IEEE/ACM Third International Workshop on Deep Learning for Testing and Testing for Deep Learning, (pp. 9-16). IEEE. https://doi.org/10.1109/DeepTest52559.2021.00008
Go to original source... - Guizzardi, R., Amaral, G., Guizzardi, G., & Mylopoulos, J. (2023). An ontology-based approach to engineering ethicality requirements. Software and Systems Modeling, 22(6), 1897-1923. https://doi.org/10.1007/s10270-023-01115-3
Go to original source... - Gurcan, F., Dalveren, G. G. M., Cagiltay, N. E., Roman, D., & Soylu, A. (2022). Evolution of Software testing Strategies and Trends: Semantic content analysis of software research corpus of the last 40 years. IEEE Access, 10, 106093-106109. https://doi.org/10.1109/access.2022.3211949
Go to original source... - Hourani, H., Hammad, A., & Lafi, M. (2019). The impact of artificial intelligence on software testing. In 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology, (pp. 565-570). IEEE. https://doi.org/10.1109/JEEIT.2019.8717439
Go to original source... - Izhar, R., Bhatti, S. N., & Alharthi, S. A. (2025). Bridging Precision and Complexity: A novel machine learning approach for ambiguity detection in software requirements. IEEE Access, 13, 12014-12031. https://doi.org/10.1109/access.2025.3529943
Go to original source... - Jabborov, A., Kharlamova, A., Kholmatova, Z., Kruglov, A., Kruglov, V., & Succi, G. (2023). Taxonomy of quality assessment for intelligent software systems: A systematic literature review. IEEE Access, 11, 130491-130507. https://doi.org/10.1109/ACCESS.2023.3333920
Go to original source... - Joeckel, L., Bauer, T., Klaes, M., Hauer, M., & Gross, J. (2021). Towards a Common Testing Terminology for Software Engineering and Data Science Experts. In Product-Focused Software Process Improvement, (pp. 281-289). Springer. https://doi.org/10.1007/978-3-030-91452-3_19
Go to original source... - Khan, M. F. I., Mahmud, F., Hossen, A., & Masum, A. (2024). A new approach of software test automation using AI. Journal of Basic Science and Engineering, 21, 559-570.
- Khaleel, S., & Anan, R. (2023). A review paper: Optimal test cases for regression testing using artificial intelligent techniques. International Journal of Electrical and Computer Engineering, 13, 1803-1816, http://doi.org/10.11591/ijece.v13i2.pp1803-1816
Go to original source... - Khatibsyarbini, M., Isa, M. A., Jawawi, D. N. A., Hamed, H. N. A., & Suffian, M. D. M. (2019). Test case prioritization using firefly algorithm for software testing. IEEE Access, 7, 132360-132373. https://doi.org/10.1109/ACCESS.2019.2940620
Go to original source... - Kiran, A., Butt, W. H., Anwar, M. W., Azam, F., & Maqbool, B. (2019). A comprehensive investigation of modern test suite optimization trends, tools and techniques. IEEE Access, 7, 89093-89117. https://doi.org/10.1109/ACCESS.2019.2926384
Go to original source... - Kokol, P. (2024). The use of AI in software engineering: A synthetic knowledge synthesis of the recent research literature. Information, 15(6), 354. https://doi.org/10.3390/info15060354
Go to original source... - Kusharki, M. B., Misra, S., Muhammad-Bello, B., Salihu, I. A., & Suri, B. (2022). Automatic classification of equivalent mutants in mutation testing of Android applications. Symmetry, 14(4), 820. https://doi.org/10.3390/sym14040820
Go to original source... - Lavin, A., Gilligan-Lee, C. M., Visnjic, A., Ganju, S., Newman, D., Ganguly, S., Lange, D., Baydin, A. G., Sharma, A., Gibson, A., Zheng, S., Xing, E. P., Mattmann, C., Parr, J., & Gal, Y. (2022). Technology readiness levels for machine learning systems. Nature Communications, 13(1), 6039. https://doi.org/10.1038/s41467-022-33128-9
Go to original source... - Layman, L., & Vetter, R. (2024). Generative artificial intelligence and the future of software testing. Computer, 57(1), 27-32. https://doi.org/10.1109/MC.2023.3306998
Go to original source... - Lee, D.-G., & Seo, Y.-S. (2020). Improving bug report triage performance using artificial intelligence based document generation model. Human-centric Computing and Information Sciences, 10(1), 26. https://doi.org/10.1186/s13673-020-00229-7
Go to original source... - Leotta, M., Ricca, F., Stoppa, S., & Marchetto, A. (2022). Is NLP-based Test Automation Cheaper Than Programmable and Capture & Replay?. In Quality of Information and Communications Technology, (pp. 77-92). Springer. https://doi.org/10.1007/978-3-031-14179-9_6
Go to original source... - Li, Y., Liu, P., Wang, H., Chu, J., & Wong, W. E. (2025). Evaluating large language models for software testing. Computer Standards & Interfaces, 93, 103942. https://doi.org/10.1016/j.csi.2024.103942
Go to original source... - Liaqat, A., Sindhu, M. A., & Siddiqui, G. F. (2020). Metamorphic testing of an artificially intelligent chess game. IEEE Access, 8, 174179-174190. https://doi.org/10.1109/ACCESS.2020.3024929
Go to original source... - Lima, R., Rosado da Cruz, A. M., & Ribeiro, J. (2020). Artificial intelligence applied to software testing: A literature review. In 15th Iberian Conference on Information Systems and Technologies, (pp. 1-6). IEEE. https://doi.org/10.23919/CISTI49556.2020.9141124
Go to original source... - Liu, Z., Su, T., Zakharov, M. A., Wei, G., & Lee, S. (2025). Software defect prediction based on residual/shuffle network optimized by upgraded fish migration optimization algorithm. Scientific Reports, 15(1), 7201. https://doi.org/10.1038/s41598-025-91784-5
Go to original source... - Lu, Q., Zhu, L., Xu, X., Whittle, J., Douglas, D., & Sanderson, C. (2022). Software engineering for Responsible AI: An empirical study and operationalised patterns. In 2022 IEEE/ACM 44th International Conference on Software Engineering: Software Engineering in Practice, (pp. 241-242). ACM. https://doi.org/10.1145/3510457.3513063
Go to original source... - Martinez-Fernandez, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, A., Vollmer, A.M. & Wagner, S. (2022). Software engineering for AI-based systems: A survey. ACM Transactions on Software Engineering and Methodology, 31(2), 37e. https://doi.org/10.1145/3487043
Go to original source... - Mustaqeem, M., Alam, M., Mustajab, S., Alshanketi, F., Alam, S., & Shuaib, M. (2025). Comprehensive Bibliographic Survey and Forward-Looking Recommendations for Software Defect Prediction: Datasets, Validation Methodologies, Prediction Approaches, and Tools. IEEE Access, 13, 866-903., https://doi.org/10.1109/ACCESS.2024.3517419
Go to original source... - Myllyaho, L., Raatikainen, M., Mannisto, T., Mikkonen, T., & Nurminen, J. K. (2021). Systematic literature review of validation methods for AI systems. Journal of Systems and Software, 181, 111050. https://doi.org/10.1016/j.jss.2021.111050
Go to original source... - Navaei, M., & Tabrizi, N. (2022). Machine Learning in Software Development Life Cycle: A Comprehensive Review. In Proceedings of the 17th International Conference on Evaluation of Novel Approaches to Software Engineering, (pp. 344-354). ScitePress. https://doi.org/10.5220/0011040600003176
Go to original source... - Necula, S.-C., Dumitriu, F., & Greavu-Serban, V. (2024). A systematic literature review on using natural language processing in software requirements engineering. Electronics, 13(11), 2055. https://doi.org/10.3390/electronics13112055
Go to original source... - Nguyen, D. P., & Maag, S. (2020). Codeless Web Testing using Selenium and Machine Learning. In Proceedings of the 15th International Conference on Software Technologies, (pp. 51-60). ScitePress. https://doi.org/10.5220/0009885400510060
Go to original source... - Nikolaidis, N., Flamos, K., Gulati, K., Feitosa, D., Ampatzoglou, A., & Chatzigeorgiou, A. (2024). A Comparison of the Effectiveness of ChatGPT and Co-Pilot for Generating Quality Python Code Solutions. In 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering, (pp. 93-101). IEEE. https://doi.org/10.1109/SANER-C62648.2024.00018
Go to original source... - Oedingen, M., Engelhardt, R. C., Denz, R., Hammer, M., & Konen, W. (2024). ChatGPT code detection: Techniques for uncovering the source of code. AI, 5(3), 1066-1094. https://doi.org/10.3390/ai5030053
Go to original source... - Ogrizović, M., Drašković, D., & Bojić, D. (2024). Quality assurance strategies for machine learning applications in big data analytics: an overview. Journal of Big Data, 11(1), 156. https://doi.org/10.1186/s40537-024-01028-y
Go to original source... - Olszewska, J. (2020). AI-T: Software Testing Ontology for AI-based Systems. In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, (pp. 291-298). ScitePress. https://doi.org/10.5220/0010147902910298
Go to original source... - Pandit, M., Gupta D., Anand D., Goyal N., Aljahdali H. M., Mansilla A. O., Kadry S., Kumar A. (2022). Towards design and feasibility analysis of DePaaS: AI based global unified software defect prediction framework. Applied Sciences, 12(1), 493. https://doi.org/10.3390/app12010493
Go to original source... - Poth, A., Beck, Q., & Riel, A. (2019). Artificial Intelligence Helps Making Quality Assurance Processes Leaner. In Systems, Software and Services Process Improvement, (pp. 722-730). Springer. https://doi.org/10.1007/978-3-030-28005-5_56
Go to original source... - Rahman, T., Singh, R., & Sultan, M. (2024). Automating Patch Set Generation from Code Review Comments Using Large Language Models. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering, (pp. 273-274). ACM. https://doi.org/10.1145/3644815.3644981
Go to original source... - Rajapaksha, D., Tantithamthavorn, C., Jiarpakdee, J., Bergmeir, C., Grundy, J., & Buntine, W. (2022). SQAPlanner: Generating data-informed software quality improvement plans. IEEE Transactions on Software Engineering, 48(8), 2814-2835. https://doi.org/10.1109/TSE.2021.3070559
Go to original source... - Ramirez, A., Berrios, M., Raul Romero, J., & Feldt, R. (2023). Towards Explainable Test Case Prioritisation with Learning-to-Rank Models. In 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops, (pp. 66-69). IEEE. https://doi.org/10.1109/ICSTW58534.2023.00023
Go to original source... - Rodriguez, G., Soria, A., & Campo, M. (2016). Artificial intelligence in service-oriented software design. Engineering Applications of Artificial Intelligence, 53, 86-104. https://doi.org/10.1016/j.engappai.2016.03.009
Go to original source... - Sagodi, Z., Siket, I., & Ferenc, R. (2024). Methodology for code synthesis evaluation of LLMs presented by a case study of ChatGPT and Copilot. IEEE Access, 12, 72303-72316. https://doi.org/10.1109/ACCESS.2024.3403858
Go to original source... - Saklamaeva, V., & Pavlic, L. (2024). The potential of AI-driven assistants in scaled agile software development. Applied Sciences, 14(1), 319. https://doi.org/10.3390/app14010319
Go to original source... - Silva-Rodriguez, V., Nava-Munoz, S. E., Castro, L. A., Martinez-Perez, F. E., Perez-Gonzalez, H. G., & Torres-Reyes, F. (2020). Classifying design-level requirements using machine learning for a recommender of interaction design patterns. IET Software, 14(5), 544-552. https://doi.org/10.1049/iet-sen.2019.0291
Go to original source... - Sofian, H., Yunus, N. A. M., & Ahmad, R. (2022). Systematic mapping: Artificial intelligence techniques in software engineering. IEEE Access, 10, 51021-51040. https://doi.org/10.1109/ACCESS.2022.3174115
Go to original source... - Steidl, M., Felderer, M., & Ramler, R. (2023). The pipeline for the continuous development of artificial intelligence models-Current state of research and practice. Journal of Systems and Software, 199, 111615. https://doi.org/10.1016/j.jss.2023.111615
Go to original source... - Strandberg, P., Frasheri, M., & Enoiu, E. (2021). Ethical AI-Powered Regression Test Selection. In 2021 IEEE International Conference on Artificial Intelligence Testing (AITest), (pp. 83-84). IEEE. https://doi.org/10.1109/AITEST52744.2021.00025
Go to original source... - Subha, R., Haldorai, A., & Ramu, A. (2023). Artificial intelligence model for software reusability prediction system. Intelligent Automation & Soft Computing, 35(3), 2639-2654. https://doi.org/10.32604/iasc.2023.028153
Go to original source... - Tahvili, S., Hatvani, L., Ramentol, E., Pimentel, R., Afzal, W., & Herrera, F. (2020). A novel methodology to classify test cases using natural language processing and imbalanced learning. Engineering Applications of Artificial Intelligence, 95, 103878. https://doi.org/10.1016/j.engappai.2020.103878
Go to original source... - Tao, C., Gao, J., & Wang, T. (2019). Testing and quality validation for AI software-Perspectives, issues, and practices. IEEE Access, 7, 120164-120175. https://doi.org/10.1109/ACCESS.2019.2937107
Go to original source... - Tosi, D. (2024). Studying the quality of source code generated by different AI generative engines: An empirical evaluation. Future Internet, 16(6), 188. https://doi.org/10.3390/fi16060188
Go to original source... - Van Eck, N. J., & Waltman, L. (2023). VOSviewer (Version 1.6.20) [Software]. Centre for Science and Technology Studies. https://www.vosviewer.com/download
- Vinayakumar, R., Alazab, M., Soman, K. P., Poornachandran, P., & Venkatraman, S. (2019). Robust intelligent malware detection using deep learning. IEEE Access, 7, 46717-46738. https://doi.org/10.1109/ACCESS.2019.2906934
Go to original source... - Wang, Y., Guo, S., & Tan, C. W. (2025). From code generation to software testing: AI Copilot with Context-Based Retrieval-Augmented Generation. IEEE Software, 42(4), 34-42. https://doi.org/10.1109/ms.2025.3549628
Go to original source... - Xie, X., Zhang, Z., Chen, T. Y., Liu, Y., Poon, P.-L., & Xu, B. (2020). METTLE: A METamorphic testing approach to assessing and validating unsupervised machine learning systems. IEEE Transactions on Reliability, 69(4), 1293-1322. https://doi.org/10.1109/TR.2020.2972266
Go to original source... - Zhang, X., & Jiang, Y. (2020). Research and application of machine learning in automatic program generation. Chinese Journal of Electronics, 29(6), 1001-1015. https://doi.org/10.1049/cje.2020.10.006
Go to original source... - Zhu, H., Bayley, I., Liu, D., & Zheng, X. (2020). Automation of Datamorphic Testing. In 2020 IEEE International Conference On Artificial Intelligence Testing (AITest), (pp. 64-72). IEEE. https://doi.org/10.1109/AITEST49225.2020.00017
Go to original source... - Zhu, H., & Bayley, I. (2022). Discovering boundary values of feature-based machine learning classifiers through exploratory data-morphic testing. Journal of Systems and Software, 187, 111231. https://doi.org/10.1016/j.jss.2022.111231
Go to original source...
This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.

ORCID...