IntermediateLesson 5 of 16

Data and Model Versioning

Track which code, data, features, experiment, and model artifact produced every production release so you can debug, compare, and roll back safely.

🧒 Simple Explanation (ELI5)

If a cake turns out great, you need the exact recipe, ingredients, and oven settings to make it again. MLOps versioning records the equivalent for machine learning.

🔧 Why Do We Need It?

🌍 Real-world Analogy

An aircraft manufacturer tracks every serialized part in every plane. If one part fails, they know exactly which aircraft are affected. MLOps needs that same discipline for data and models.

⚙️ Technical Explanation

Versioning in MLOps spans multiple layers: Git for code, versioned data assets for training inputs, experiment tracking for run metadata, feature pipeline versions for transformations, model registries for approved artifacts, and deployment metadata for what is actually live. Without this chain, incident response becomes guesswork.

🔗
Tracking Example

A strong release chain looks like Git commit → dataset version → feature pipeline version → MLflow run → model registry version → deployed endpoint. That chain is what lets you separate data bugs from code bugs and release bugs.

📊 Visual Representation

Lineage Chain
🧾 Git Commit
🗃️ Dataset v17
🧪 Run #284
📚 Model v42
🚀 Prod Endpoint

⌨️ Commands / Syntax

bash
az ml data create --name churn-train --version 17 --path ./data/train.parquet --type uri_file
az ml model create --name churn-model --version 42 --path ./outputs/model.pkl

git tag model-churn-v42
git push origin model-churn-v42
python
import mlflow

with mlflow.start_run(run_name="churn-v42"):
    mlflow.set_tag("git_commit", "8d4f2c1")
    mlflow.set_tag("dataset_version", "churn-train:17")
    mlflow.log_param("feature_view", "customer_features_v5")
    mlflow.log_metric("auc", 0.862)
    mlflow.log_artifact("outputs/model.pkl")

💼 Example (Real-world Use Case)

A claims-triage model regresses after release. Because the team versions data and models correctly, they discover that training used dataset version 18, which contained a changed label mapping from an upstream partner. They roll back model version 42 to version 41 and quarantine dataset version 18 for investigation.

🧪 Hands-on

  1. Create a lineage table with columns for model version, dataset version, Git commit, experiment run, and deployment date.
  2. Check how many of those columns your current models can actually fill.
  3. Tag one Git commit and one model artifact with the same release identifier.
  4. Write a rollback note for one model: if version 42 fails, what exact version do you restore?

🎮 Try It Yourself

🎮
Lineage Exercise

Invent a release named fraud-v8. Write its code commit, dataset version, validation report location, model registry ID, and endpoint name. Then decide which incident would become hard to debug if any one of those links were missing.

🐛 Debugging Scenario

Problem: business accuracy drops, but nobody can tell whether the cause is new data or a code change.

🎯 Interview Questions

Beginner

What is model versioning?

It gives each deployable model artifact an immutable identity and metadata.

Why version datasets too?

Because changing data can change model behavior just as much as changing code.

What is lineage?

Lineage is the trace from data and code to the live deployed model.

Can Git alone solve MLOps versioning?

No. You also need versioned data, experiment tracking, and registry-backed artifacts.

Why is rollback hard without versioning?

Because you may not know what the last stable model actually was.

Intermediate

What should be attached to a registry entry?

Metrics, dataset version, code version, environment version, ownership, and approval status.

What is the difference between experiment tracking and model versioning?

Experiment tracking records candidate runs. Model versioning records approved artifacts ready for promotion.

Why should dataset versions be immutable?

So the same version always means the same content for reproducibility and auditability.

What is the biggest anti-pattern in version naming?

Human labels like final-final-v2 because they are ambiguous and break traceability.

Why store deployment metadata too?

Because knowing what was trained is not enough; you also need to know what was actually deployed.

Scenario-based

A regulator asks which data produced the live underwriting model. What do you provide?

The deployed model version, dataset version, code commit, validation report, and approval trail.

A release failed, but the model file was replaced in place. What lesson should the team take?

Never mutate production artifacts in place; publish immutable versions and promote by reference.

A dataset owner republishes data under the same path. Why is that dangerous?

It breaks reproducibility because historical runs no longer reference stable content.

A retrained model performs worse, but no code changed. Where do you investigate first?

Investigate dataset versions, label quality, and feature freshness first.

Two teams train the same model name with different datasets. How do you prevent confusion?

Use explicit versioning, clear ownership metadata, and promotion policies in the registry.

🌐 Real-world Usage

Finance, healthcare, and insurance teams depend on lineage for audit and rollback. Consumer platforms rely on it to compare candidates and recover from regressions quickly.

📝 Summary

If you cannot trace a production model back to exact data and code, you do not really control it. Versioning is the backbone of reliable MLOps.