Technical Deep Dive: Building a Python-Driven Master Data Management (MDM) Pipeline -

By Akthar Miah, Data Engineer at Nimble Approach

This blog provides a technical walkthrough of how to build an end-to-end Master Data Management (MDM) process using Python. It demonstrates how to transform messy, multi-source CRM data into a single trusted master record through matching, clustering, and survivorship logic.

Consider this common scenario: A large enterprise has multiple CRM systems deployed across various departments – sales uses one platform, marketing another, customer service a third, and perhaps regional offices maintain their own localised systems. Each system captures similar but slightly different information, updated at different frequencies, with varying levels of data quality and completeness. When the time comes to generate reports or perform analytics, the organisation faces a critical question: which system contains the “true” data? Is “Innovate Ltd” in one system the same as “Innovate Limited” in another?

This is where Master Data Management (MDM) comes in.

Master Data Management (MDM) is the discipline of creating one single, authoritative master record for each key data entity – a “golden record” aka “master record”. We will stick to the term “master record” throughout the rest of the course to distinguish it from the medallion structure used in the Data space.

To make this process concrete, we’ll walk through a Python-based MDM workflow that incrementally shapes raw, unified CRM data into trusted master records – covering standardisation, record linkage, weighted matching, clustering, and survivorship. Each step builds logically on the previous one, creating a clear narrative from ingestion to final mastered output.

The Big Picture: Where MDM Fits in Your Data Warehouse

Before we dive into the code, let’s understand where this process sits in the grand scheme of a modern data warehouse. The script we’re analysing assumes the initial, challenging steps of extracting data from source systems and unifying them into a single structure have already been completed. This is often handled in the Bronze (raw) and Silver (standardised, cleaned) layers of a data lakehouse.

For demonstration purposes, we’ve created synthetic datasets that replicate real-world data challenges. As an example, we’ve simulated scenarios where “Innovate Ltd” appears in Source System A while “Innovate Limited” exists in Source System B, representing the same company but with slight naming variations across different CRM platforms.

We will continue operating in the silver layer for the MDM process. The MDM (shown below “MDM Script Record Linkage”) is the penultimate step before the data is dimensionally modelled ready for the Gold Layer.

The resulting clean data is used to build a mastered dimension layer (e.g., a DimCompany table). Now our dimensons are built with mastered data which ensures that all business facts, such as sales or support tickets, link back to a single, reliable company record, eliminating duplicate reporting and providing a true 360-degree field of view.

A Step-by-Step Guide to the MDM Script

Our script begins its work on unified_companies.csv, a file where company records from different systems have been brought together. What follows is a sequential, narrative walkthrough of how this raw file is transformed into trusted master records.

Step 1: Loading, Cleaning, and Trust Scoring

The first job is to load the data and prepare it for comparison.

Loading: The script loads the unified_companies.csv file into a pandas DataFrame.

Standardising: To compare text fields effectively, consistency is key. All text in key columns like company_name and headquarters_street is converted to lowercase and stripped of leading/trailing whitespace.

cols_to_standardize = ['company_name', 'headquarters_street', 'headquarters_city', 'headquarters_county', 'headquarters_country', 'industry']

Assigning Trust: Not all data sources are created equal. The script assigns a trust_score to each record based on its source system.

source_trust_weights = {
        'manual': 1.0,
        'sap': 0.90,
        'salesforce': 0.85,
        'dynamics': 0.80,
}

Step 2: Candidate Pair Generation (Indexing)

Before comparing records, we need to determine which pairs to evaluate. Comparing all records against all others is computationally expensive.

indexer = recordlinkage.Index()
indexer.full()
candidate_links = indexer.index(df)

For small datasets, this full comparison is fine; larger pipelines should use blocking strategies for efficiency.

Step 3: Feature Engineering and Comparison

This step compares attributes between candidate record pairs using algorithms suited to each data type:

Company Name: jarowinkler
Street & Industry: damerau_levenshtein
City, County, Country, Postcode: exact match

compare_cl = recordlinkage.Compare()
...

Each comparison produces a similarity or match score.

Step 4: Weighted Scoring and Match Classification

Weighted scoring assigns importance to different features and computes a final match score.

feature_weights = { ... }
weighted_scores = features.dot(pd.Series(feature_weights))
match_threshold = 1.3

Records above the threshold are classified as matches.

Step 5: Clustering Connected Matches

Matched pairs are grouped into clusters representing unique real-world entities.

G = nx.from_edgelist(match_indices.to_list())
clusters = list(nx.connected_components(G))

If A matches B and B matches C, they form a single cluster.

Step 6: Survivorship and Manual Overrides

Survivorship determines the final master record for each cluster.

Automated: The highest trust_score wins.
Manual overrides: Data stewards can enforce exceptions.

manual_override_exceptions = { ... }

This hybrid approach ensures both accuracy and governance.

The Final Output: The Survivorship View

The final DataFrame contains:

mapped_to_master_id
match_reason
match_score
trust_score

This becomes the foundation for building DimCompany with clean mastered records.

Why It Matters: Accurate Analytics

With duplicate entities removed and mastered, reporting becomes clear and consistent:

One company
One identity
One set of facts

Business users can finally trust the numbers.

Conclusion

This Python-driven MDM workflow demonstrates how raw, inconsistent CRM data can be systematically transformed into trusted, analytics-ready master records. By progressing through standardisation, indexing, comparison, weighted scoring, clustering, and survivorship, the process creates a coherent pipeline that produces a reliable single source of truth.

A strong MDM foundation ensures that downstream dashboards, metrics, and business decisions are based on accurate and unified data – not fragmented system outputs. With the right technical approach, organisations can eliminate duplication, strengthen governance, and fully unlock the value of their customer and company data.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::b	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::f	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_GRECAPTCHA	6 months	Google Recaptcha service sets this cookie to identify bots to protect the website against malicious spam attacks.

Cookie	Duration	Description
yt-player-bandwidth	never	The yt-player-bandwidth cookie is used to store the user's video player preferences and settings, particularly related to bandwidth and streaming quality on YouTube.
yt-player-headers-readable	never	The yt-player-headers-readable cookie is used by YouTube to store user preferences related to video playback and interface, enhancing the user's viewing experience.
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Technical Deep Dive: Building a Python-Driven Master Data Management (MDM) Pipeline

The Big Picture: Where MDM Fits in Your Data Warehouse

A Step-by-Step Guide to the MDM Script

Step 1: Loading, Cleaning, and Trust Scoring

Step 2: Candidate Pair Generation (Indexing)

Step 3: Feature Engineering and Comparison

Step 4: Weighted Scoring and Match Classification

Step 5: Clustering Connected Matches

Step 6: Survivorship and Manual Overrides

The Final Output: The Survivorship View

Why It Matters: Accurate Analytics

Conclusion

Get In Touch

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostWhy Mutation Testing Is Essential for Trustworthy AI

Next PostWhy You Need to be Doing Security Testing For Your AI (And How to Do It)

Menu

Contact

Technical Deep Dive: Building a Python-Driven Master Data Management (MDM) Pipeline

The Big Picture: Where MDM Fits in Your Data Warehouse

A Step-by-Step Guide to the MDM Script

Step 1: Loading, Cleaning, and Trust Scoring

Step 2: Candidate Pair Generation (Indexing)

Step 3: Feature Engineering and Comparison

Step 4: Weighted Scoring and Match Classification

Step 5: Clustering Connected Matches

Step 6: Survivorship and Manual Overrides

The Final Output: The Survivorship View

Why It Matters: Accurate Analytics

Conclusion

Get In Touch

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostWhy Mutation Testing Is Essential for Trustworthy AI

Next PostWhy You Need to be Doing Security Testing For Your AI (And How to Do It)

You May Also Like

From Excel Exports to Trusted Insight: SP Electricity North West’s Data Transformation Journey

Reflections from the Government Digital Service (GDS) Local Techstack Workshop

Enhancing AI With Organisational Context: Why Less Is Often More

Menu

Contact