By Chris Sherlock, Head of Test Capability at Nimble Approach

As AI becomes more central to modern technology, it’s crucial to ensure these systems operate safely, correctly, and fairly. While traditional software testing principles still apply, AI introduces additional complexity that demands new approaches.

This blog provides a high-level, non-technical overview of how we at Nimble Approach approach testing and quality assurance (QA) for AI-enabled software systems – from simple data retrieval and summarisation tools to more advanced systems that generate documentation to support rapid policy and process development.

The Core Challenge: Why AI is Different

In traditional software development, programmers write explicit rules (for example, “If the user clicks ‘Submit,’ send the data”), and testing verifies that those rules behave as expected.

AI-based systems also follow rules, but their outputs are probabilistic rather than deterministic. The same input may not always produce the same output, which makes verification more complex – especially given the wide range of AI architectures now in use.

Each AI system can work very differently. For example, looking at Large Language Model (LLM) based systems, we could have:

Model-Only Generation: Here, we’re using a model (for example, GPT-5.2) that relies purely on its training data to provide answers.
Retrieval-Augmented Generation (RAG): Enhances a language model’s knowledge by retrieving information from external – usually unstructured – sources before generating a response.
Cache-Augmented Generation (CAG): Improves response speed and consistency by storing and reusing previous high-quality model outputs for similar queries.
Knowledge-Augmented Generation (KAG): Guides a model’s generation using structured knowledge graphs or databases to ensure factual accuracy and grounded responses.

Each of these different architectures requires slightly different targeted testing focuses, but are all underpinned by ensuring the fundamentals of the model and its data are of sufficient quality.

Testing the Model Fundamentals

While the system architecture (RAG, KAG, etc.) adds unique layers of complexity, every AI system is underpinned by the core quality of the model itself and the data it was trained on. These checks serve as the universal quality gates for any model we deploy.

Data Quality & Integrity

This is the absolute core of AI quality: Garbage In, Garbage Out. Before any model training even begins, we must ensure the quality of the dataset. This involves checking for cleanliness, representation, and bias:

Cleanliness: Are there errors, missing values, or noise in the data that could confuse the model?
Representation: Does the training data accurately reflect the real-world scenarios the system will face? A model trained only on data from one country, for example, is unlikely to perform well for another.
Bias: We must critically analyse the data for systemic biases that could lead to unfair or discriminatory outcomes.

Performance and Accuracy

Once we’re happy with the data, we measure how well the model actually performs its primary job. Unlike traditional software, where the result is simply right or wrong, AI performance is measured on a sliding scale. We can look at standard metrics like precision, or recall, but the most crucial metric is business-critical accuracy, i.e. does the model perform well enough to satisfy the commercial need and deliver value safely? This must be clearly defined and validated against real-world expectations.

Robustness and Security

Intelligent systems need to be robust. This involves testing how the model handles unexpected, unusual, or outright malicious inputs.

Edge Cases: How does the model react to data that is slightly outside its normal range?
Adversarial Attacks: These are intentional, subtle modifications to the input data designed to trick the model into producing an incorrect output. Testing for this ensures the system is secure and reliable, especially in sensitive applications.

Specialised Testing for Augmented Systems

While the checks on the model fundamentals are universal, systems that augment their LLM with external knowledge require focused, specialised testing.

Testing RAG Systems

RAG systems are heavily reliant on finding the correct information to ground the response. Testing must, therefore, focus on the retrieval process and the model’s use of the retrieved context. Without sufficient testing, RAG systems risk delivering responses that appear confident and factual but are contextually incorrect, potentially breaching policy or regulatory requirements.

Relevance: Is the system finding the correct documents from the knowledge base for a given prompt? If the retrieved documents are irrelevant, the model cannot generate an accurate answer. We could even hit context drift, whereby outdated documents available to the system are presented as truth, causing compliance or business errors.
Hallucinations: Is the final answer accurately grounded in the retrieved context? This is a crucial check to ensure the model is not ‘hallucinating’ – that is, making up facts or straying from the provided source material. If the system appears confident, but provides factually incorrect or fabricated material, it creates liability and can undermine user trust.

For example, in a banking chatbot, a query about disputing a credit card charge must retrieve the correct internal dispute policy – not unrelated content such as mortgage documentation. Testing ensures both correct retrieval and that the final response is grounded solely in the retrieved material.

Testing CAG Systems

CAG systems aim to improve response speed and consistency by reusing previous high-quality outputs. Testing must ensure that the cached outputs are still appropriate and that the caching mechanism itself is working as expected.

Quality and Consistency: Is the cached response always accurate and high quality? We check for output consistency and verify that the cached answer remains valid over time. If we don’t check here, the system may consistently serve responses that were only temporarily correct, or containing subtly incorrect information that can spread rapidly and provide misinformation to users.
Cache Invalidation: When does the system invalidate the cache to get a fresh model response? A well-tested invalidation policy ensures that users receive up-to-date information when the underlying knowledge or model capability has changed. The main risk here is cache staleness, whereby cached responses remain active even after the underlying data has changed, so the system provides outdated and incorrect answers.

In practice, testing CAG systems means validating both the quality of cached responses and the conditions under which they are refreshed. Consider an internal HR system that uses CAG to answer common employee questions, such as “What is the annual leave allowance?” Testing verifies that the cached response remains correct over time and is invalidated only when the underlying HR policy is updated, prompting a new authoritative answer to be generated and re-cached.

Testing KAG Systems

KAG systems integrate structured knowledge graphs or databases. The focus shifts to verifying the accuracy of the factual data and the system’s ability to interpret that structure. If we don’t test well, we risk the system delivering factually precise but contextually flawed information, which can lead to compliance or business errors.

Factual Accuracy: Is the model correctly interpreting and using the structured data from the knowledge graph? A key quality gate is confirming the factual integrity of the resulting response. We need to ensure that the model isn’t misinterpreting complex or ambiguous data, leading to factual errors in summaries that lead to compromised business decisions.
Knowledge Maintenance: How is the knowledge graph maintained and updated? Since structured data changes over time, we must have robust processes to ensure the model is always accessing the most current and accurate information. If we don’t actively manage our data, we risk data drift, where the system retrieves information that is technically relevant but factually incorrect, creating a false sense of confidence in responses that are fundamentally wrong.

For example, in a pharmaceutical research application, testing verifies that the model correctly retrieves precise numerical data from structured knowledge graphs (such as dosage levels or trial outcomes) and presents it accurately in narrative summaries.

Operational QA and Monitoring

Testing an AI system is not a one-off event; it’s a continuous process that extends beyond the initial deployment. Since real-world data and user behaviour constantly change, the QA focus shifts to ongoing validation and maintenance.

Monitoring in Production: This involves setting up robust systems to track the model’s performance in a live environment. We continuously monitor for key indicators such as: a sudden drop in model accuracy, data drift (where new live data starts to diverge significantly from the training data), and the potential emergence of bias in production outputs. Early detection of these changes is crucial for preventing system failure.
Feedback Loops: A vital part of operational QA is establishing clear channels for user feedback. We need a system for gathering, analysing, and prioritising this feedback, then using it systematically to inform model retraining and updates to the knowledge bases. This ensures the AI system constantly learns and remains relevant.
The QA Team’s Evolving Role: The role of the QA team becomes less about running fixed test scripts and more about continuous monitoring, validation, and system maintenance. It transforms into an MLOps (Machine Learning Operations) focus, where quality assurance is integrated into the entire lifecycle, ensuring that the system is not only correct on day one but remains reliable and safe indefinitely.

Conclusion

Effective testing of AI systems requires a combination of established software quality assurance principles and a more specialised, continuous approach. Ultimately, quality is achieved through robust checks across three critical areas: the source data, the core model, and the system’s augmentation architecture (such as RAG, KAG, and similar approaches).

As AI evolves at pace, our approach to quality must evolve with it. At Nimble Approach, we embed this continuous focus on quality throughout the entire lifecycle, ensuring the AI systems we deliver are not only innovative, but consistently reliable, safe, and fair.

Looking for support in assuring the quality of your AI systems? Reach out to our team today.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::b	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::f	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_GRECAPTCHA	6 months	Google Recaptcha service sets this cookie to identify bots to protect the website against malicious spam attacks.

Cookie	Duration	Description
yt-player-bandwidth	never	The yt-player-bandwidth cookie is used to store the user's video player preferences and settings, particularly related to bandwidth and streaming quality on YouTube.
yt-player-headers-readable	never	The yt-player-headers-readable cookie is used by YouTube to store user preferences related to video playback and interface, enhancing the user's viewing experience.
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

How Do You Test AI Systems?

The Core Challenge: Why AI is Different

Testing the Model Fundamentals

Data Quality & Integrity

Performance and Accuracy

Robustness and Security

Specialised Testing for Augmented Systems

Testing RAG Systems

Testing CAG Systems

Testing KAG Systems

Operational QA and Monitoring

Conclusion

Get In Touch

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostBeyond the Hype: Helping Businesses Turn AI Experimentation into Real Impact

Next PostNimble's Top Product Picks for 2026

Menu

Contact

How Do You Test AI Systems?

The Core Challenge: Why AI is Different

Testing the Model Fundamentals

Data Quality & Integrity

Performance and Accuracy

Robustness and Security

Specialised Testing for Augmented Systems

Testing RAG Systems

Testing CAG Systems

Testing KAG Systems

Operational QA and Monitoring

Conclusion

Get In Touch

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostBeyond the Hype: Helping Businesses Turn AI Experimentation into Real Impact

Next PostNimble's Top Product Picks for 2026

You May Also Like

Technical Deep Dive: Promptfoo vs DeepEval for Automated AI Evaluation

The Critical Role of Observability in High‑Throughput Transaction Systems

Beyond the Hype: Helping Businesses Turn AI Experimentation into Real Impact

Menu

Contact