Why I love Test-Driven Development (TDD) and Other Data Engineers Should, too

By George Verney, Head of the Data Capability at Nimble Approach.

What is Test-Driven Development (TDD)?

Before we dive in – what is this thing?

Test-Driven Development (TDD) is a software approach where tests are written before the actual code.
By writing tests first, developers are forced to think about the requirements and functionality they are implementing, leading to more robust and reliable code.

I find that TDD is a little alien to most Data Engineers – and under used by most software engineers as well!

I started as a skeptic, too. Particularly when wearing my “data hat” 🥳

But the landscape of data engineering is changing, and I think the broad adoption of languages such as Python, R and Scala for our work at scale is adding emphasis to the engineering aspect of our roles more and more.

How I hope you feel by the end of this article 🤩

A Simple Example of TDD to Illustrate

Your Product Owner brings you some feature requirements for everyone’s favourite subject: dogs! 🐶

Dogs are wonderful. They come in all shapes and sizes. Tiny pups are still learning to howl, older dogs like to boof under their breath, other dogs simply woof, while others are quiet.

So you start to think about modelling your dogs as an object. In this exercise we’ll be using Python.

			
# dog.py
from dataclasses import dataclass
@dataclass
class Doggo:
    name: str
    age: int
    @property
    def sound(self) -> str:
        if self.age <= 1:
            return "Awoo!"
        if self.age > 8:
            return "Boof."
        return "Woof"

		

And because we’re good engineers, we’re going to write some tests for our Doggo object…

			
# test_dog.py
def test_doggo_sound_puppy():
    expected = "Awoo!"
    actual = Doggo(name="Layla", age=1).sound
    assert expected == actual, "Puppies howl cutely!"
def test_doggo_sound_senior():
    expected = "Boof."
    actual = Doggo(name="Charlie", age=14).sound
    assert expected == actual, "Older dogs boof under the breath"
def test_doggo_sound_default():
    expected = "Woof"
    actual = Doggo(name="Bob", age=7).sound
    assert expected == actual, "Your average doggo woofs"

		

100% test coverage – aren’t we amazing! 🤘 Smash this in to a pull request and bask in the glory of another perfect lump of code… right? Close.

What’s wrong? All my tests pass

True. All your tests pass. Your code has 100% coverage. Your peers find no fault during review and you get the code merged without issue.

But you’ve missed something in your rush towards perfection…

Let’s revisit the requirements:

Dogs are wonderful. They come in all shapes and sizes. Tiny pups are still learning to howl, older dogs like to boof under their breath, other dogs simply woof, while others are quiet.

I was blinded by writing code about my favourite subject that I’m an expert in that I missed something 😳

For instance: did you know that basenji have the nickname of “barkless”?

Let’s try again, using TDD

If we shift our mindset to writing the tests first, we are much more likely to consider all the requirements.

Here we go again – this time starting with a “stub” class, then the tests before the implementation code.

			
# dog.py
class Doggo:
    @property
    def sound(self) -> str:
        Pass
# test_dog.py
from unittest import mock
def test_doggo_sound_puppy():
    expected = "Awoo!"
    actual = Doggo(age=1).sound
    assert expected == actual, "Puppies howl cutely!"
def test_doggo_sound_senior():
    expected = "Boof."
    actual = Doggo(age=14).sound
    assert expected == actual, "Older dogs boof under the breath"
def test_doggo_sound_default():
    expected = "Woof"
    actual = Doggo(age=7).sound
    assert expected == actual, "Your average doggo woofs"
def test_doggo_sound_quiet():
    expected = None
    actual = Doggo(age=mock.ANY).sound
    assert expected == actual, "Some dogs don't make noise"

		

Note the last test case. We don’t know how we’re going to implement it yet, but we know that it’s a scenario we need to cover.

Obviously none of these pass right now – but once they all do, we know our doggos are behaving correctly!

Here’s one potential approach:

			
from dataclasses import dataclass
@dataclass
class Doggo:
    name: str
    age: int
    is_mute: bool = False
    @property
    def sound(self) -> str | None:
        if self.is_mute:
            return None
        if self.age <= 1:
            return "Awoo!"
        if self.age > 8:
            return "Boof."
        return "Woof"

		

This requires a minor update to the final test case, namely adding a new keyword argument to that instance:

			
def test_doggo_sound_quiet():
    expected = None
    actual = Doggo(age=mock.ANY, is_mute=True).sound
    assert expected == actual, "Some dogs don't make noise"

So why should you care about TDD? Lessons learned

The biggest shift for me wasn’t technical – it was mindset.

Prioritising your test cases first will help you:

Build the right thing first time round (lead by the acceptance criteria)
Only write the code that is needed (avoid rabbit holes on features not in the spec)
Focus on a modular design (encourages simpler to test code in a workflow that narrows focus to each test case)
Increase code quality (helps to follow the K.I.S.S. principle and gives you “self-documenting” code via the test cases)
Reduce risk of regressions (test coverage FTW!)
Save you time and effort on refactoring (your existing tests protect against regressions during refactoring)
And most importantly: still allow you to write and ship awesome code! (don’t forget that “test code” is still code… you just might not love it yet)

The end result is that your users will be happier as you’ve produced a rock-solid product that matches the user’s asks! 🎉

Next Steps

Now you’re fully on-board, where do you go next?

Practice

A FizzBuzz coding Kata is a classic problem you’ve probably come across before; but how about you write your tests first?
Advent Of Code – a fun, annual, series of coding challenges that really lend themselves to TDD (they give you a bunch of test scenarios in the question!). The first puzzle is a great place to start.

Python Libraries

Is it bad form to suggest RTFM? The Python standard library unittest module is probably my most visited resource for all Python development!
Despite the above; I’m a pytest-by-default user now (largely due to fixturing)
Start reporting on your code coverage (coverage|pytest-cov) to know where you’re currently at.
Consider diff-cover for new work, ensuring your additions and changes have tests.

Want to bring TDD practices into your data workflows?

At Nimble Approach, we help teams embed quality engineering at every stage, from pipelines to platforms.

👉 Get in touch with us to find out how we can support your team.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::b	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::f	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_GRECAPTCHA	6 months	Google Recaptcha service sets this cookie to identify bots to protect the website against malicious spam attacks.

Cookie	Duration	Description
yt-player-bandwidth	never	The yt-player-bandwidth cookie is used to store the user's video player preferences and settings, particularly related to bandwidth and streaming quality on YouTube.
yt-player-headers-readable	never	The yt-player-headers-readable cookie is used by YouTube to store user preferences related to video playback and interface, enhancing the user's viewing experience.
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Why I love Test-Driven Development (TDD) and Other Data Engineers Should, too

What is Test-Driven Development (TDD)?

A Simple Example of TDD to Illustrate

What’s wrong? All my tests pass

Let’s try again, using TDD

So why should you care about TDD? Lessons learned

Next Steps

Practice

Python Libraries

Want to bring TDD practices into your data workflows?

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostChoosing the right CMS: What tech leaders need to know to enable growth, scale and success

Next PostUnderstanding Risky Assumptions. A Smarter Way to Build Products and services

Menu

Contact

Why I love Test-Driven Development (TDD) and Other Data Engineers Should, too

What is Test-Driven Development (TDD)?

A Simple Example of TDD to Illustrate

What’s wrong? All my tests pass

Let’s try again, using TDD

So why should you care about TDD? Lessons learned

Next Steps

Practice

Python Libraries

Want to bring TDD practices into your data workflows?

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostChoosing the right CMS: What tech leaders need to know to enable growth, scale and success

Next PostUnderstanding Risky Assumptions. A Smarter Way to Build Products and services

You May Also Like

The Critical Role of Observability in High‑Throughput Transaction Systems

Beyond the Hype: Helping Businesses Turn AI Experimentation into Real Impact

How Do You Test AI Systems?

Menu

Contact