By Akthar Miah, Data Engineer at Nimble Approach
This blog provides a technical walkthrough of how to build an end-to-end Master Data Management (MDM) process using Python. It demonstrates how to transform messy, multi-source CRM data into a single trusted master record through matching, clustering, and survivorship logic.
Consider this common scenario: A large enterprise has multiple CRM systems deployed across various departments – sales uses one platform, marketing another, customer service a third, and perhaps regional offices maintain their own localised systems. Each system captures similar but slightly different information, updated at different frequencies, with varying levels of data quality and completeness. When the time comes to generate reports or perform analytics, the organisation faces a critical question: which system contains the “true” data? Is “Innovate Ltd” in one system the same as “Innovate Limited” in another?
This is where Master Data Management (MDM) comes in.
Master Data Management (MDM) is the discipline of creating one single, authoritative master record for each key data entity – a “golden record” aka “master record”. We will stick to the term “master record” throughout the rest of the course to distinguish it from the medallion structure used in the Data space.
To make this process concrete, we’ll walk through a Python-based MDM workflow that incrementally shapes raw, unified CRM data into trusted master records – covering standardisation, record linkage, weighted matching, clustering, and survivorship. Each step builds logically on the previous one, creating a clear narrative from ingestion to final mastered output.
The Big Picture: Where MDM Fits in Your Data Warehouse
Before we dive into the code, let’s understand where this process sits in the grand scheme of a modern data warehouse. The script we’re analysing assumes the initial, challenging steps of extracting data from source systems and unifying them into a single structure have already been completed. This is often handled in the Bronze (raw) and Silver (standardised, cleaned) layers of a data lakehouse.
For demonstration purposes, we’ve created synthetic datasets that replicate real-world data challenges. As an example, we’ve simulated scenarios where “Innovate Ltd” appears in Source System A while “Innovate Limited” exists in Source System B, representing the same company but with slight naming variations across different CRM platforms.
We will continue operating in the silver layer for the MDM process. The MDM (shown below “MDM Script Record Linkage”) is the penultimate step before the data is dimensionally modelled ready for the Gold Layer.
The resulting clean data is used to build a mastered dimension layer (e.g., a DimCompany table). Now our dimensons are built with mastered data which ensures that all business facts, such as sales or support tickets, link back to a single, reliable company record, eliminating duplicate reporting and providing a true 360-degree field of view.
A Step-by-Step Guide to the MDM Script
Our script begins its work on unified_companies.csv, a file where company records from different systems have been brought together. What follows is a sequential, narrative walkthrough of how this raw file is transformed into trusted master records.
Step 1: Loading, Cleaning, and Trust Scoring
The first job is to load the data and prepare it for comparison.
Loading: The script loads the unified_companies.csv file into a pandas DataFrame.
Standardising: To compare text fields effectively, consistency is key. All text in key columns like company_name and headquarters_street is converted to lowercase and stripped of leading/trailing whitespace.
cols_to_standardize = ['company_name', 'headquarters_street', 'headquarters_city', 'headquarters_county', 'headquarters_country', 'industry']
Assigning Trust: Not all data sources are created equal. The script assigns a trust_score to each record based on its source system.
source_trust_weights = {
'manual': 1.0,
'sap': 0.90,
'salesforce': 0.85,
'dynamics': 0.80,
}
Step 2: Candidate Pair Generation (Indexing)
Before comparing records, we need to determine which pairs to evaluate. Comparing all records against all others is computationally expensive.
indexer = recordlinkage.Index()
indexer.full()
candidate_links = indexer.index(df)
For small datasets, this full comparison is fine; larger pipelines should use blocking strategies for efficiency.
Step 3: Feature Engineering and Comparison
This step compares attributes between candidate record pairs using algorithms suited to each data type:
- Company Name: jarowinkler
- Street & Industry: damerau_levenshtein
- City, County, Country, Postcode: exact match
compare_cl = recordlinkage.Compare()
...
Each comparison produces a similarity or match score.
Step 4: Weighted Scoring and Match Classification
Weighted scoring assigns importance to different features and computes a final match score.
feature_weights = { ... }
weighted_scores = features.dot(pd.Series(feature_weights))
match_threshold = 1.3
Records above the threshold are classified as matches.
Step 5: Clustering Connected Matches
Matched pairs are grouped into clusters representing unique real-world entities.
G = nx.from_edgelist(match_indices.to_list())
clusters = list(nx.connected_components(G))
If A matches B and B matches C, they form a single cluster.
Step 6: Survivorship and Manual Overrides
Survivorship determines the final master record for each cluster.
Automated: The highest trust_score wins.
Manual overrides: Data stewards can enforce exceptions.
manual_override_exceptions = { ... }
This hybrid approach ensures both accuracy and governance.
The Final Output: The Survivorship View
The final DataFrame contains:
- mapped_to_master_id
- match_reason
- match_score
- trust_score
This becomes the foundation for building DimCompany with clean mastered records.
Why It Matters: Accurate Analytics
With duplicate entities removed and mastered, reporting becomes clear and consistent:
- One company
- One identity
- One set of facts
Business users can finally trust the numbers.
Conclusion
This Python-driven MDM workflow demonstrates how raw, inconsistent CRM data can be systematically transformed into trusted, analytics-ready master records. By progressing through standardisation, indexing, comparison, weighted scoring, clustering, and survivorship, the process creates a coherent pipeline that produces a reliable single source of truth.
A strong MDM foundation ensures that downstream dashboards, metrics, and business decisions are based on accurate and unified data – not fragmented system outputs. With the right technical approach, organisations can eliminate duplication, strengthen governance, and fully unlock the value of their customer and company data.














