By Markus Albrecht, Head of Technology at Nimble Approach
This blog explores the growing gap between high code coverage and real software reliability as AI begins writing both code and tests. Using a practical experiment with GenAI-generated test suites, it shows how mutation testing uncovers hidden weaknesses and outlines a faster, human-guided workflow for ensuring trustworthy, production-ready software.
So you’ve written your unit tests and generated your coverage report. 95% coverage – impressive at first glance. But what does that number really mean? While it confirms that most lines of code were executed, coverage by itself does not demonstrate that the code behaved as intended.
That’s exactly the gap mutation testing fills. Think of it as “a test for your tests.” The concept isn’t new – it’s been around since the 1970s – but for decades it was considered too computationally expensive to be practical. After all, who wants to run their entire test suite thousands of times? Fortunately, modern test automation frameworks have become fast enough that mutation testing is finally a realistic, powerful tool we can put to use.
The Experiment: Putting GenAI-Generated Tests to the Test
To evaluate this approach, we selected a Python module and asked a GenAI assistant to generate the initial unit tests. At first glance, the results appeared promising:
- All 12 generated tests passed.
- We had ~85% code coverage.
Encouraging metrics – at least on the surface. However, once we introduced a mutation testing tool, the picture changed significantly.
| Metric | Initial (GenAI-Generated) | Final (Human-Refined) |
| Code Coverage | ~85% | 100% |
| Mutation Kill Rate | 57.3% | 80.4% |
| Test Count | 12 | 33 |
Our mutation kill rate was a dismal 57.3%. In other words, nearly 43% of the simple, single-fault bugs (mutants) intentionally introduced into the code went undetected. The tests passed, but in many cases, they failed to provide meaningful validation of the underlying logic.
Our Lessons: Why GenAI-Generated Tests Failed
So what did we learn? We realised that GenAI tests are “optimistically correct” but they lack “adversarial thinking.” They don’t think like a real, sceptical developer. They systematically failed in a few key areas.
1. GenAI Checks State, Not Behaviour
This was the big one. The AI-generated tests were great at checking return values (state) but completely ignored interactions (behaviour).
- The GenAI Test: assert result == expected_dictionary.
- The Problem: This test would pass even if the function never bothered to call the mock it was given.
- The Fix: We had to manually add the real test: That the mock was actually called.
2. Logging and Counters Are Ignored
This was the single largest category of survivors, making up nearly 45% of our initial gaps. The GenAI never thought to use pytest’s caplog fixture to check the logging output.
- The Problem: A critical mutation that changed count += 1 to count = 1 (a nasty bug!) just sailed right on through. Why? Because no test was actually reading the log message INFO: “Found 2 employees…” to see if the ‘2’ was correct.
- The Fix: We had to add specific tests using caplog to assert that the correct log messages, and the correct counts within them, were being generated.
3. GenAI Misses Adversarial Edge Cases
GenAI is brilliant at testing the “happy path.” it is terrible at thinking like a skeptic.
- The Problem: The AI wrote tests for empty strings (“”) but just… forgot about None values. It didn’t test for an empty dictionary ({}) versus a dictionary that was missing a required key ({“other_key”: “value”}).
- The Fix: We had to manually add all those annoying, fiddly, essential tests for None values, missing keys, and logical boundary conditions (like where a mutant changes an or to an and survived).
Why This is Non-Negotiable in the GenAI Era
Generative AI coding assistants are productivity superchargers. They write vast quantities of code and tests at a speed we’ve never seen before. But that new velocity introduces two massive risks.
First, it creates a flood of code for human review. Teams are left with an enormous amount of new code to validate. This quickly leads to code review fatigue – a very real problem. When you’re overwhelmed by the sheer quantity, are you really spotting that a test is checking state but not behaviour? It’s inevitable that things will be missed.
Second, AI is phenomenal at generating plausible-looking code. As our experiment showed, it can write a test that gives 85% coverage but has a 57% kill rate.
This brings us to the 64-million-dollar question: If an AI writes your code and the same AI generates your tests, who is actually verifying the AI’s work – especially when human reviewers are already operating at capacity?
You cannot and should not just blindly trust the AI. You need an automated, objective umpire.
Mutation testing is that umpire.
In the GenAI-driven development lifecycle, mutation testing becomes the essential, automated quality-control mechanism. It’s the only process that can tirelessly and rigorously validate the effectiveness of the tests your AI partner is generating. It moves us from “Did the AI write a test?” to “Did the AI write a good test?”
The New Workflow: Using Mutation Gaps to Guide GenAI
Here is the most significant insight: mutation testing should not merely highlight deficiencies in your test suite – it should guide the GenAI in improving them.
Previously, our workflow involved manually writing tests. Now, with AI-driven support, the process is far more efficient and iterative:
- GenAI First Pass: An engineer prompts GenAI to “write a full test suite” for the module. This gets us our 12 tests and that rubbish 57.3% kill rate.
- Run mutmut (other frameworks are available): We run mutation testing and get a clear, actionable ‘to-do’ list of 78 surviving mutants.
- Human-Guided GenAI: Now, the engineer plays ‘art director’. They don’t manually write the 21 missing tests. Instead, they use the mutation testing report to write new, highly specific prompts for the GenAI:
- “Write a test using caplog that asserts the log message ‘X employees missing’ contains the correct count.”
- “Write a test for get_email_by_email that fails if the get_employee_directory mock is not called exactly once.”
- “Write a test that passes a None value for the email field and asserts the function handles it.”
This “human-in-the-loop” refinement is incredibly fast. It’s how we efficiently grew the suite from 12 to 33 high-quality tests and raised our kill rate from 57.3% to a production-ready 80.4%.
The Real ROI: Why This is Finally Viable
This brings us to the most important point. For 50 years, mutation testing was considered an expensive, academic luxury. GenAI, paradoxically, makes it an affordable necessity.
Let’s have a look at the maths:
- Traditional Test-Driven Development (TDD): 8 hours of manual coding + 2-3 hours of more manual coding (to write 21 new tests by hand to fix mutants) = 10-12 hours total. That 2-3 hour validation step is a 25-30% time overhead, so teams were more likely to skip it – especially if deadlines were tight.
- GenAI-Assisted TDD: 3-4 hours of AI-assisted coding + 2-3 hours of human analysis & iteration = 5-7 hours total.
That new 2-3 hours is not spent coding. The AI does the typing in seconds. That time is now re-allocated to high-value human work that only a developer can do:
- Analysis: A human must review the mutation testing report – perhaps with 78 surviving mutants – and identify the underlying patterns. The AI is not yet capable of performing this level of synthesis. A human can recognise, for example, that 25 relate to logging, 20 stem from missing assert_called_once, and 10 involve None versus “” edge cases. This requires analytical reasoning, not simply mechanical execution.
- Strategy & Iteration: A human then has to direct the AI. This is our “human-in-the-loop” process: analyse the gap, write a clever prompt, run the tool, check the score, and repeat.
Because GenAI reduces initial development time so significantly, this validation stage is no longer optional – it becomes an integral part of a workflow that remains 40–50% faster overall.
And that’s the real shift: if AI accelerates software delivery, we must adopt equally rigorous, automated methods to ensure its output is trustworthy. Relying on AI-generated tests without verifying their effectiveness is no more reliable than celebrating 100% code coverage.
As AI generates more and more of our software, the question is no longer “Is my code tested?” but “Are my tests strong enough to catch real defects?”
Mutation testing is how you get that answer – and how you ensure AI-driven development remains accountable, reliable, and safe.
Conclusion
The maths is undeniable: GenAI paired with mutation testing is the best way to achieve reliability without blowing your budget. But moving from a basic test run to a sophisticated, human-in-the-loop workflow is a serious engineering hurdle.
That’s where we can help.
We look past the vanity metrics of high code coverage. As experts in Quality Engineering and Test Automation, we help you architect the pipelines, configure the tools, and build the feedback loops that will make your systems robust.














