By Chris Sherlock, Head of Test Capability at Nimble Approach
As AI becomes more central to modern technology, it’s crucial to ensure these systems operate safely, correctly, and fairly. While traditional software testing principles still apply, AI introduces additional complexity that demands new approaches.
This blog provides a high-level, non-technical overview of how we at Nimble Approach approach testing and quality assurance (QA) for AI-enabled software systems – from simple data retrieval and summarisation tools to more advanced systems that generate documentation to support rapid policy and process development.
The Core Challenge: Why AI is Different
In traditional software development, programmers write explicit rules (for example, “If the user clicks ‘Submit,’ send the data”), and testing verifies that those rules behave as expected.
AI-based systems also follow rules, but their outputs are probabilistic rather than deterministic. The same input may not always produce the same output, which makes verification more complex – especially given the wide range of AI architectures now in use.
Each AI system can work very differently. For example, looking at Large Language Model (LLM) based systems, we could have:
- Model-Only Generation: Here, we’re using a model (for example, GPT-5.2) that relies purely on its training data to provide answers.
- Retrieval-Augmented Generation (RAG): Enhances a language model’s knowledge by retrieving information from external – usually unstructured – sources before generating a response.
- Cache-Augmented Generation (CAG): Improves response speed and consistency by storing and reusing previous high-quality model outputs for similar queries.
- Knowledge-Augmented Generation (KAG): Guides a model’s generation using structured knowledge graphs or databases to ensure factual accuracy and grounded responses.
Each of these different architectures requires slightly different targeted testing focuses, but are all underpinned by ensuring the fundamentals of the model and its data are of sufficient quality.
Testing the Model Fundamentals
While the system architecture (RAG, KAG, etc.) adds unique layers of complexity, every AI system is underpinned by the core quality of the model itself and the data it was trained on. These checks serve as the universal quality gates for any model we deploy.
Data Quality & Integrity
This is the absolute core of AI quality: Garbage In, Garbage Out. Before any model training even begins, we must ensure the quality of the dataset. This involves checking for cleanliness, representation, and bias:
- Cleanliness: Are there errors, missing values, or noise in the data that could confuse the model?
- Representation: Does the training data accurately reflect the real-world scenarios the system will face? A model trained only on data from one country, for example, is unlikely to perform well for another.
- Bias: We must critically analyse the data for systemic biases that could lead to unfair or discriminatory outcomes.
Performance and Accuracy
Once we’re happy with the data, we measure how well the model actually performs its primary job. Unlike traditional software, where the result is simply right or wrong, AI performance is measured on a sliding scale. We can look at standard metrics like precision, or recall, but the most crucial metric is business-critical accuracy, i.e. does the model perform well enough to satisfy the commercial need and deliver value safely? This must be clearly defined and validated against real-world expectations.
Robustness and Security
Intelligent systems need to be robust. This involves testing how the model handles unexpected, unusual, or outright malicious inputs.
- Edge Cases: How does the model react to data that is slightly outside its normal range?
- Adversarial Attacks: These are intentional, subtle modifications to the input data designed to trick the model into producing an incorrect output. Testing for this ensures the system is secure and reliable, especially in sensitive applications.
Specialised Testing for Augmented Systems
While the checks on the model fundamentals are universal, systems that augment their LLM with external knowledge require focused, specialised testing.
Testing RAG Systems
RAG systems are heavily reliant on finding the correct information to ground the response. Testing must, therefore, focus on the retrieval process and the model’s use of the retrieved context. Without sufficient testing, RAG systems risk delivering responses that appear confident and factual but are contextually incorrect, potentially breaching policy or regulatory requirements.
- Relevance: Is the system finding the correct documents from the knowledge base for a given prompt? If the retrieved documents are irrelevant, the model cannot generate an accurate answer. We could even hit context drift, whereby outdated documents available to the system are presented as truth, causing compliance or business errors.
- Hallucinations: Is the final answer accurately grounded in the retrieved context? This is a crucial check to ensure the model is not ‘hallucinating’ – that is, making up facts or straying from the provided source material. If the system appears confident, but provides factually incorrect or fabricated material, it creates liability and can undermine user trust.
For example, in a banking chatbot, a query about disputing a credit card charge must retrieve the correct internal dispute policy – not unrelated content such as mortgage documentation. Testing ensures both correct retrieval and that the final response is grounded solely in the retrieved material.
Testing CAG Systems
CAG systems aim to improve response speed and consistency by reusing previous high-quality outputs. Testing must ensure that the cached outputs are still appropriate and that the caching mechanism itself is working as expected.
- Quality and Consistency: Is the cached response always accurate and high quality? We check for output consistency and verify that the cached answer remains valid over time. If we don’t check here, the system may consistently serve responses that were only temporarily correct, or containing subtly incorrect information that can spread rapidly and provide misinformation to users.
- Cache Invalidation: When does the system invalidate the cache to get a fresh model response? A well-tested invalidation policy ensures that users receive up-to-date information when the underlying knowledge or model capability has changed. The main risk here is cache staleness, whereby cached responses remain active even after the underlying data has changed, so the system provides outdated and incorrect answers.
In practice, testing CAG systems means validating both the quality of cached responses and the conditions under which they are refreshed. Consider an internal HR system that uses CAG to answer common employee questions, such as “What is the annual leave allowance?” Testing verifies that the cached response remains correct over time and is invalidated only when the underlying HR policy is updated, prompting a new authoritative answer to be generated and re-cached.
Testing KAG Systems
KAG systems integrate structured knowledge graphs or databases. The focus shifts to verifying the accuracy of the factual data and the system’s ability to interpret that structure. If we don’t test well, we risk the system delivering factually precise but contextually flawed information, which can lead to compliance or business errors.
- Factual Accuracy: Is the model correctly interpreting and using the structured data from the knowledge graph? A key quality gate is confirming the factual integrity of the resulting response. We need to ensure that the model isn’t misinterpreting complex or ambiguous data, leading to factual errors in summaries that lead to compromised business decisions.
- Knowledge Maintenance: How is the knowledge graph maintained and updated? Since structured data changes over time, we must have robust processes to ensure the model is always accessing the most current and accurate information. If we don’t actively manage our data, we risk data drift, where the system retrieves information that is technically relevant but factually incorrect, creating a false sense of confidence in responses that are fundamentally wrong.
For example, in a pharmaceutical research application, testing verifies that the model correctly retrieves precise numerical data from structured knowledge graphs (such as dosage levels or trial outcomes) and presents it accurately in narrative summaries.
Operational QA and Monitoring
Testing an AI system is not a one-off event; it’s a continuous process that extends beyond the initial deployment. Since real-world data and user behaviour constantly change, the QA focus shifts to ongoing validation and maintenance.
- Monitoring in Production: This involves setting up robust systems to track the model’s performance in a live environment. We continuously monitor for key indicators such as: a sudden drop in model accuracy, data drift (where new live data starts to diverge significantly from the training data), and the potential emergence of bias in production outputs. Early detection of these changes is crucial for preventing system failure.
- Feedback Loops: A vital part of operational QA is establishing clear channels for user feedback. We need a system for gathering, analysing, and prioritising this feedback, then using it systematically to inform model retraining and updates to the knowledge bases. This ensures the AI system constantly learns and remains relevant.
- The QA Team’s Evolving Role: The role of the QA team becomes less about running fixed test scripts and more about continuous monitoring, validation, and system maintenance. It transforms into an MLOps (Machine Learning Operations) focus, where quality assurance is integrated into the entire lifecycle, ensuring that the system is not only correct on day one but remains reliable and safe indefinitely.
Conclusion
Effective testing of AI systems requires a combination of established software quality assurance principles and a more specialised, continuous approach. Ultimately, quality is achieved through robust checks across three critical areas: the source data, the core model, and the system’s augmentation architecture (such as RAG, KAG, and similar approaches).
As AI evolves at pace, our approach to quality must evolve with it. At Nimble Approach, we embed this continuous focus on quality throughout the entire lifecycle, ensuring the AI systems we deliver are not only innovative, but consistently reliable, safe, and fair.
Looking for support in assuring the quality of your AI systems? Reach out to our team today.














