AI-supported test case generation
Be Germany’s AI working group is constantly evaluating practical applications of artificial intelligence in the finance industry. This insight is a snapshot of our latest findings concerning AI-assisted software testing. The objective of this study is to evaluate the effectiveness of various AI models in generating test cases based on functional specification. This study aims to assess and evaluate the impact of various AI open-source models which can be run locally in minimising the effort required for the test case creation task. A standardized and identical prompt was provided to every model under consideration, and their results were analysed and compared based on quality, completeness, logical flow, organization of output and compliance with ISTQB standards. The findings provide insights into the various advantages and limitations of each AI model in the context of test case generation and offers scope for their potential use in automating the testing processes.
Introduction
One of the most challenging and labour-intensive stages of the software testing lifecycle is the test case derivation phase. In the past, this process necessitated highly skilled quality assurance engineers to carefully review the specifications and build detailed test scenarios that verify and validate system performance under diverse working conditions. As the system grows, the complexity of the software to be tested also increases. As the system expands, manually creating test cases becomes more resource-intensive and prone to human error.
To address these challenges, the software industry is increasingly turning to the mega-trend of artificial intelligence (AI). Our study evaluates the performance of multiple AI models in generating test cases using a specification document for a small segment of a trading application.
What is “AI” in This Publication?
In this publication, “AI” refers specifically to large language models (LLMs), a type of artificial intelligence designed to process and generate human-like text. LLMs are advanced machine learning models trained on vast amounts of text data to predict and generate coherent responses based on input prompts. They use deep neural networks, particularly transformer architectures, to understand context, relationships, and patterns in language.
We focus on open-source LLMs that can be run locally, ensuring greater control, privacy, and customizability compared to cloud-based alternatives. These models are capable of tasks such as code generation, debugging, and test case creation, making them powerful tools for software testing. By leveraging their ability to understand natural language and programming syntax, LLMs enable efficient automation and augmentation of testing processes.
Research Methodology
In the preliminary stage of the study, an extensive analysis of several locally executable open-source AI models was carried out. During assessment process, several factors such as model performance, efficiency, compatibility with the infrastructure that is available, ease of use, and ethical aspects were considered.
Based on this comprehensive analysis, four AI models were shortlisted for the final round of study. The selection process ensured that the models would offer thorough insights while matching the specific requirements of the study.
The research approach was designed to guarantee an objective and fair examination of AI models’ capability to generate test cases. Each model under consideration was given the same prompt and same functional specification. This approach mitigated the possible inconsistencies in the input conditions that could have compromised the output.
- Input Preparation: Creation of a prompt and a standardized functional specification that describes the order entry window of a trading application and the required conditions for
- AI Model Processing: Every model receives same input independently with set temperatures and other text generation attributes.
- Output Collection: Systematic and methodical collection of the outputs generated from every model without any modification or post-processing.
- Evaluation: Review and assessment of outputs based on the below mentioned.
- ISTQB Compliance: Adherence to software testing standards
- Quality: Clarity, precision, and level of detail
- Logical Flow: Logical and well-structured test steps
- Completeness: Coverage of positive, negative and edge case test scenarios
- Organization: Organised presentation of test cases
The review process incorporated both quantitative and qualitative assessments. Factors like number of valid test cases, repeated test cases, invalid test cases and new test cases were considered as part of quantitative metrics. The qualitative review included analysis carried out by experienced software testing professionals proficient in ISTQB standards. This dual approach provided an extensive knowledge of the merits and demerits of test case generation from functional specification.
Practical Applications
- Streamlining QA workflows in software development
- Reducing time-to-market for software products
- Improving test coverage through AI assistance
- Optimizing resource allocation in testing departments
Comparative Analysis and Numerical results
The comparative analysis shows significant differences in the test case generation competencies of the AI models. A detailed quantitative assessment of the results across the evaluation factors highlights each model’s strengths and weaknesses.

From the above table, it is observed that:
Model 1 demonstrates high output generation capability with 80% test coverage having no repeated or invalid testcases, missing only 20% of scenarios as compared to the task performed by the QA team. The model did not introduce new test scenarios prioritizing test coverage rather than generating creative cases. The model also showcased outstanding capability in structuring test cases with clear and logical flow.
Model 2 offers a balanced performance with respectable 66.67 % coverage and noteworthy degree of creativity. However, 13.33% test repetition suggests that there is scope for improvement in output filtering. While the model fits moderately with the benchmark set manually, the delta suggests that the model can be used for complementary testing.
Model 3 shows low output volume and weak validity with more than 50% of the scenarios missed. No new scenarios were introduced, suggesting limited test case generation diversity. As a result, this model should only be used for lightweight testing tasks where less test coverage is accepted compared to using as a stand-alone option.
Model 4 is notable for its creativeness but struggled with dependability. Although it generated 66.67% of test cases as compared to manual test case generation, only 40% were valid. The repeated and invalid test cases shows noise in the testing process. Generation of 26.67% new test cases shows that the model can be creative indicating expansion beyond the defined scope.

Conclusion
Our comprehensive analysis of test cases generated by AI models from functional specification presents key insights into the strengths and weaknesses of major AI models available in today’s market. The key takeaway is that, though AI powered test generation has a great potential, its effectiveness varies significantly across models and implementation approach. Model M1 exhibits superior capability to develop structured, ISTQB complaint test cases with a negligible degree of invalidity or repetition. Integrating multiple AI models can deliver a more comprehensive test coverage than any single model working alone. Human intervention is necessary for testing important applications because of limitations in logical flow and organisation capabilities of an AI model. Effective prompts have a significant impact on the quality of the output and adherence to testing standards. AI assisted test case generation can potentially enhance the coverage and reduce manual effort.







