🏃💡 Six-Pack Series Ⅵ: The Unsung Heroes of Data – Slowly Changing Dimensions

The Role of SCDs in Building Scalable and Reliable Data Systems

Jan 08, 2025

"If your data strategy doesn't account for change, you might as well be writing your business plan in permanent marker."

Look, we need to talk about Slowly Changing Dimensions (SCDs). Not because they're sexy – they're about as glamorous as a database administrator's coffee mug – but because they're the difference between a data architecture that scales and one that collapses faster than a tech startup's valuation after a failed demo.

Dimensions are the building blocks of data systems, defining ‘who,’ ‘what,’ or ‘where.’ Slowly Changing Dimensions (SCDs) take it further by tracking how these change over time—an essential feature for AI and analytics

Hang on, Why are these called Slowly Changing Dimensions again?

First, let's clear up some terminology. We call them "Slowly" Changing Dimensions, but that's a bit of a misnomer. Some dimensions change rapidly (think stock prices), others glacially (like geography codes), and some barely at all. What matters isn't the speed – it's how we track and manage those changes.

At their core, dimensions are just nouns – people, places, and things. They're the descriptive backbone of your data model that give meaning to your facts. Without them, you're just counting numbers without context.

The Hard Truth About Dimensions

Let's cut through the jargon. A dimension is just descriptive data about something in your business – customers, products, locations, whatever keeps your CFO up at night. But here's the kicker: this stuff changes. Constantly. Customers move houses, products get rebranded, and your carefully crafted data model starts looking like a teenager's bedroom if you don't plan for it.

Want to know the most expensive words in data architecture? "We'll just put everything in one big table." Sure, and I'll just store all my money under my mattress – both are great ideas until you actually need to use them.

Why SCDs Matter More Than Ever

Listen, if you're building anything with AI, machine learning, or even basic analytics, you're going to need SCDs. It's like building a house – you can skip the foundation and go straight to picking out curtains, but don't be surprised when the whole thing falls apart.

Here's why they're non-negotiable in 2025:

AI Isn't Magic Your fancy LLM needs clean, contextual, historical data. Without SCDs, you're feeding it the data equivalent of fast food and expecting Michelin-star results.
Compliance Isn't Optional GDPR, AI Act, privacy regulations – they all demand knowing who knew what, when. Try explaining to auditors why you can't tell them what a customer's status was six months ago. I'll wait.
Real-time Isn't Really Real-time Without proper SCD management, your "real-time" analytics are about as current as last week's newspaper.

The SCD Toolbox: Choosing the Right Approach

Selecting the right Slowly Changing Dimension (SCD) type is all about balancing your use case, complexity, and resource considerations. Here’s a clear and practical breakdown:

#SCD Types and Their Use Cases

How to Decide

Start Simple: If historical tracking isn’t critical, Type 1 may suffice.
Plan for Growth: If you anticipate future need for historical data, Type 2 or Type 6 should be considered.
Optimize for Performance: For high-speed, real-time requirements, Type 7 is invaluable.

The choice ultimately depends on your use case, the importance of historical data, and the complexity your system can handle. By understanding these types, you can design dimensions that align with your organization’s goals and future-proof your data strategy.

Practical Implementation Tips (Because Theory Doesn't Pay the Bills)

Start Simple Don't try to boil the ocean. Pick one critical dimension and get it right. It's like eating an elephant – one bite at a time, and yes, I know that's a horrible metaphor.
Automate or Die Manual SCD management is like trying to herd cats – technically possible but a terrible use of everyone's time. Think burn out, attrition… not fun things!
Performance Matters
- Partition your historic data like you partition your problems – into manageable chunks
- Keep frequently accessed data in-memory (Type 7) if you can afford it
- Index smartly, unless you enjoy watching query times that rival geological epochs.

The Hard Truth About Retrofitting

Here's something they don't teach you in data modeling courses: retrofitting SCDs into an existing architecture is like trying to replace your house's foundation while still living in it. Technically possible? Yes. Practically feasible? About as likely as getting universal agreement on tabs versus spaces, or sheets versus tabs. Oh, and it can be pricey.

The reason is simple but brutal: SCD implementation affects everything:

Your data loading processes
Your primary and foreign key relationships
Your reporting logic
Your downstream analytics, bi & ai
Your metadata management
Your backup and recovery procedures
And quite a few others but I think you get the point.

Before we show the technical bit that you need to understand, here’s a clearer, non-technical explanation:

How a Type 2 Slowly Changing Dimension Works (In Plain English)

Imagine you run a business, and you want to keep track of every time a customer updates their information—like changing their address. Instead of just replacing the old address with the new one, you keep both. This lets you see not only where the customer lives now but also where they used to live, and when those changes happened.

A Type 2 Slowly Changing Dimension (SCD) is a way of organizing your data so it keeps that history. Here’s how it works step-by-step:

How It’s Set Up

You start with a customer table that includes:

A unique identifier for each customer (called a "CustomerKey").
Their current details like name and address.
Two dates: ValidFrom (when this information started being true) and ValidTo (when it stopped being true).
A marker called Current that tells you which row is the most up-to-date.
A version number (EffectiveVersion) that helps you track each change.

What Happens When a Customer Updates Their Information?

Find the Current Data: When the customer updates their address, the system looks for the row in the table where the customer’s data is marked as "Current."
Close the Old Row: The system adds an end date (ValidTo) to the old information and marks it as no longer current (Current = 0).
Insert the New Row: A new row is added for the customer, containing their updated information. It gets:
- Today’s date as the ValidFrom.
- A placeholder future date (e.g., "9999-12-31") as ValidTo to show it’s the current record.
- A version number one higher than the last version.

Why This Matters

Preserving History: You can see all the addresses the customer has ever had and when they changed. This is invaluable for analyzing trends or resolving disputes.
Improved Decision-Making: With a clear record of changes, your marketing team might notice that customers who move frequently spend more, or your operations team might see patterns in service issues by location.
Auditability: If someone asks, "What address did this customer give us when they signed up for a service two years ago?" you’ll have the answer.

Real-World Example

Let’s say your customer John Smith originally lived at 123 Emerald Street. Six months later, he moves to 456 Foxy Avenue,what is the impact? To help you understand, here’s a practical example of how a customer’s address changes might be handled in a Type 2 SCD

Why Businesses Love This Approach

Type 2 SCDs give you the best of both worlds:

Current Data: You always know the customer’s latest details.
Historical Data: You can analyze past trends and behaviors without losing older information.

This approach ensures your business operates with both accuracy and a deep understanding of how things change over time. Whether you’re managing customer data, product prices, or employee roles, Type 2 SCDs are a powerful way to stay organized.

-- A typical Type 2 SCD structure

This isn't just about storing data – it's about maintaining referential integrity across time. Every fact table that references this dimension needs to know exactly which version of the customer it's pointing to. Get this wrong, and your historical analysis becomes about as reliable as a weather forecast.

Why SCDs Matter More Than Ever: Modern Performance Considerations

A critical point for 2025: With the rise of vector databases and feature stores, your SCD strategy needs to account for new performance demands. Here's what you need to consider:

# Example: Managing SCD in a feature store

Cloud Implications: The Bill Nobody Talks About

Let's talk cloud economics, because your CFO certainly will. Running SCDs in the cloud isn't just a technical decision – it's a financial one.

Platform-Specific Considerations:

Snowflake: Leverage micro-partitioning for SCD Type 2. Cost per TB: $23-$40. Optimize your clustering keys based on your query patterns.
Redshift: Use distribution and sort keys aligned with your SCD strategy. Consider using dist key on current_flag for Type 2.
BigQuery: Partition by ValidFrom dates and cluster on business keys. Watch those partition scans – they're not free.

Implementation tip: Don't just copy your on-prem SCD strategy to the cloud. A typical Type 2 dimension that costs $1K/month on-prem could easily hit $5K in cloud storage without proper optimization.

Note: The claims about cloud implications are estimates and cloud providers do change pricing structures. However, the actual costs depend on the data volume, update frequency, and how well the architecture is optimized for the cloud platform. Properly leveraging the unique features of each platform can mitigate much of the cost escalation highlighted in the example.

The AI/ML Reality Check

Your machine learning pipeline is only as good as its training data. Here's how SCDs impact AI initiatives:

# Real-world example of temporal consistency in ML

Without proper SCD management:

Training data becomes temporally inconsistent
Model drift increases
Feature store performance tanks
Your data scientists start sending you passive-aggressive Slack messages

Real-World Cost Impact

Let's talk numbers:

*Based on a typical 10M row dimension with 5-year history in a busy table.

Decision Framework:

Business value of historical tracking
Regulatory requirements
Query patterns (historical vs. current state)
Storage budget
Development resources

The Compliance Angle

With the EU AI Act and similar global regulations & general awareness hitting, SCDs aren't just nice to have – they're your audit trail. You need to prove:

Data lineage
Change history
Point-in-time accuracy
Feature evolution

Getting this wrong isn't just a technical problem – it's a board-level issue. Your SCD strategy needs to stand up to regulatory scrutiny.

Bottom Line: Your Action Plan

#Your SCD Action Plan

Common Disasters and How to Dodge Them

The "We'll Fix It Later" Trap Spoiler alert: Later never comes, and technical debt compounds faster than a loan shark's interest.
The Tool Trap That shiny new data tool promising automatic SCD handling? Read the fine print. Then read it again.
The Complexity Trap You don't need every type of SCD. You need the ones that solve your actual problems.

The Bottom Line

SCDs aren't optional anymore. They're as fundamental to modern data architecture as coffee is to developers. Without them, you're building on quicksand and hoping for the best.

Your data strategy is your company's DNA – and SCDs are how you manage mutations without creating monsters. Get them right, and you've got a foundation that'll support whatever AI, analytics, or regulatory compliance curve balls come your way.

Remember: Good data architecture isn't about following trends – it's about building systems that scale with your business while keeping your auditors happy and your budgets intact.

Want more straight-talking data architecture insights? Follow me on LinkedIn or subscribe to my Substack. No fluff, no vendor pitch, just battle-tested wisdom from someone who's seen every data disaster you're trying to avoid.

The Practical Data Architect

Discussion about this post