Artificial Intelligence in Software Testing and Beyond: A Review of Current Practices and Emerging Challenges

Lisaru, Codrina-Victoria; Kifor, Claudiu-Vasile

Artificial Intelligence in Software Testing and Beyond: A Review of Current Practices and Emerging Challenges

Codrina-Victoria Lisaru and Claudiu-Vasile Kifor

Acta Informatica Pragensia, vol. preprint

Abstract: Background: Artificial intelligence (AI) is increasingly used both to test software (T1) and to assure AI-based systems (T2), with adjacent software-engineering work that shapes testing practice (T3). Prior reviews are mostly descriptive and rarely report comparable maturity or replicability signals.Objective: To provide a PRISMA-style systematic review (2015-2025, Web of Science) that maps T1-T2-T3 within a testing-centric frame, audits evidence maturity, threats reporting, and artefact openness per paper, and adds an explicit lens of large language models or generative AI (LLMs/GenAI).Methods: We queried the Web of Science Core Collection (2015-2025), screened via a predefined protocol, and extracted ten items (D1-D10) per study to normalize comparisons. Seventy-two papers met the criteria. Findings are organized into three themes: (T1) AI-based software testing, (T2) testing/validation of AI systems, and (T3) AI-related software engineering topics with implications for testing-T3 corresponding to the "beyond" in the paper's title.Results: The corpus is limited in practice-oriented evidence: 31 laboratory/simulation, 3 industrial, 10 hybrid, 6 conceptual/guideline and 22 secondary studies. Only 18/72 provide public artefacts; 33/72 report no empirical metrics. By theme, T1=32, T2=15, T3=25; the LLMs/GenAI subset totals 10 papers. Openness strongly co-occurs with measurable outcomes (88.9% of artefact-sharing papers report metrics vs 42.6% without), yet "all-three credible" studies (industrial/hybrid + open artefacts + metrics) are rare (4/72 overall; 1/10 for LLMs/GenAI).Conclusion: AI shows promise for testing, but evidence remains thin on industrial adoption and reproducibility. We recommend prioritizing hybrid/industrial validations, releasing artefacts by default, and using standardized task-metric bundles. The review presents T1 and T2 results, separates T3 for scope clarity, and provides actionable maturity and replicability signals to guide responsible, empirical adoption.

Keywords: Software testing; Artificial intelligence; AI; AI-driven testing; Software engineering; Requirements engineering; Human-AI collaboration; Software quality; Large Language Models; LLMs (search for similar items in EconPapers)
References: Add references at CitEc
Citations:

Downloads: (external link)
http://aip.vse.cz/doi/10.18267/j.aip.303.html (text/html)
free of charge

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:prg:jnlaip:v:preprint:id:303

Ordering information: This journal article can be ordered from
Redakce Acta Informatica Pragensia, Katedra systémové analýzy, Vysoká škola ekonomická v Praze, nám. W. Churchilla 4, 130 67 Praha 3
http://aip.vse.cz

DOI: 10.18267/j.aip.303

Access Statistics for this article

Acta Informatica Pragensia is currently edited by Editorial Office

More articles in Acta Informatica Pragensia from Prague University of Economics and Business Contact information at EDIRC.
Bibliographic data for series maintained by Stanislav Vojir ().