BeginnerLesson 1 of 16

What is MLOps and the ML Lifecycle

Understand why machine learning models fail in production, what MLOps solves, and how the complete ML lifecycle maps to engineering workflows that deliver reliable AI systems.

🧒 Simple Explanation (ELI5)

Imagine a scientist discovers a cure for a disease in a lab. The discovery works perfectly in the lab, but getting it to patients requires factories, FDA approvals, cold-chain logistics, and quality checks. Without all that infrastructure, the cure never reaches anyone.

MLOps is the same idea for machine learning. A data scientist builds a model in a notebook and it works great — but getting it to real users, running reliably 24/7, and improving over time requires an entire engineering system. MLOps is that system.

🔧 Why MLOps Exists

87% of ML projects never reach production — not because the models are bad, but because there's no engineering infrastructure to deploy them safely.
Models break silently: Unlike software bugs that throw errors, a degrading ML model keeps returning predictions — just increasingly wrong ones. Nobody notices until business metrics drop.
Data and code both change: Software just needs code version control. ML needs version control for data, code, models, and hyperparameters simultaneously.
Reproducibility crisis: A model trained 3 months ago is impossible to recreate because nobody tracked which data version, library version, or config was used.
Retraining complexity: Re-running a Jupyter notebook in production is not a deployment strategy. Automated, validated retraining pipelines are.

🌍 Real-world Analogy

MLOps is like the factory, supply chain, and quality control system behind a restaurant's kitchen. The chef (data scientist) creates recipes (models). But getting those dishes to thousands of customers reliably every day requires standardised recipes (reproducible training), food safety checks (model validation), a delivery system (deployment), health inspections (monitoring), and a feedback loop when customers complain (retraining).

DevOps solved the same problem for traditional software. MLOps solves it for machine learning.

⚙️ The ML Lifecycle — End to End

Stage 1: Data Management

Data collection from sources (databases, APIs, data lakes)
Data versioning with DVC or Azure Data Asset
Feature engineering and validation
Train/validation/test splits with controlled randomness (fixed seeds)

Stage 2: Model Development

Experiment tracking — log hyperparameters, metrics, and artifacts to MLflow or Azure ML
Model training with tracked runs (never train without tracking)
Evaluation against baseline — model must beat the current champion
Validation: data quality, fairness, bias checks

Stage 3: Model Packaging and Registration

Package model with dependencies (MLflow model, ONNX, Docker)
Register model in a Model Registry with metadata (training dataset, metrics, author)
Promote through stages: Staging → Production

Stage 4: Deployment

Deploy to real-time endpoint (REST API) or batch pipeline
Blue-green or canary deployment to control rollout risk
Automated smoke tests against the deployed endpoint

Stage 5: Monitoring and Feedback

Monitor prediction distributions, latency, error rates
Detect data drift (input features changing) and concept drift (relationship between features and labels changing)
Collect ground truth labels when available to measure real accuracy
Trigger retraining when drift exceeds threshold

📊 Visual: ML Lifecycle Loop

The ML Lifecycle: Train → Deploy → Monitor → Retrain

📊 Data
Collect + Version

→

🧪 Train
Experiment Tracking

→

✅ Validate
Beat Baseline

→

📦 Register
Model Registry

→

🚀 Deploy
Endpoint / Batch

🔄 Retrain
Triggered Pipeline

←

⚠️ Drift Alert
PSI / KS test

←

📈 Monitor
Predictions + Data

←

🏭 Serve
Real-time / Batch

←

─

📈 MLOps Maturity Model

Google and Microsoft define MLOps maturity in levels. Understanding where you are tells you what to build next.

Level	Name	What it means	Typical team
Level 0	Manual	Notebooks, manual retraining, no pipeline automation	Early-stage startups, research teams
Level 1	ML Pipeline Automation	Automated training pipelines, experiment tracking, model registry	Most production ML teams
Level 2	CI/CD for ML	Automated model validation, CI/CD gates, staging environments	Platform engineering org
Level 3	Continuous Training	Drift-triggered retraining, fully automated lifecycle, feature stores	FAANG / large enterprise

⌨️ Your First MLOps Pipeline — Getting Started

python

"""
Minimal MLOps: track a model experiment with MLflow.
Install: pip install mlflow scikit-learn
"""
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# ── Load data ─────────────────────────────────────────────────────────────────
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ── Configure MLflow tracking ─────────────────────────────────────────────────
mlflow.set_tracking_uri("http://localhost:5000")   # or use Azure ML workspace URI
mlflow.set_experiment("iris-classification-v1")

with mlflow.start_run(run_name="random-forest-baseline"):
    # ── Hyperparameters ───────────────────────────────────────────────────────
    params = {"n_estimators": 100, "max_depth": 4, "random_state": 42}
    mlflow.log_params(params)

    # ── Train ─────────────────────────────────────────────────────────────────
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)

    # ── Evaluate ──────────────────────────────────────────────────────────────
    preds = model.predict(X_test)
    acc   = accuracy_score(y_test, preds)
    f1    = f1_score(y_test, preds, average='weighted')
    mlflow.log_metrics({"accuracy": acc, "f1_weighted": f1})

    # ── Log model artifact ────────────────────────────────────────────────────
    mlflow.sklearn.log_model(
        model,
        artifact_path="model",
        registered_model_name="iris-classifier"   # register in Model Registry
    )
    print(f"Accuracy: {acc:.4f} | F1: {f1:.4f}")
    print(f"Run ID: {mlflow.active_run().info.run_id}")

bash

# Start MLflow tracking server locally
pip install mlflow scikit-learn
mlflow server --host 0.0.0.0 --port 5000

# Open http://localhost:5000 to view experiments
# Run the Python script and watch metrics appear in the UI

🧪 Hands-on

Install MLflow: pip install mlflow scikit-learn and start the tracking server.
Run the experiment tracking script above. Open http://localhost:5000 and verify the run appears with accuracy and F1 metrics.
Change n_estimators to 50 and run again. Compare both runs in the MLflow UI — which performed better?
Click on the run in MLflow, go to Artifacts, and verify the model was registered automatically in the Model Registry.
Identify where your current ML workflows are on the maturity model (Level 0–3). List 3 specific improvements that would move you to the next level.

🎮 Try It Yourself

🎮

Challenge: Map Your ML System to the Lifecycle

Input: Take any ML project you know (real or hypothetical). Draw the 5 lifecycle stages (Data → Train → Register → Deploy → Monitor) and fill in what happens at each stage today.
Gap analysis: Which stages are fully automated? Which are manual? For each manual stage, write one sentence on what automation would look like.
Maturity score: Using the table above, assign a maturity level (0–3) to the project. What is the single highest-impact change to move to the next level?
Production failure exercise: Imagine the model was deployed 6 months ago and never updated. List 3 ways it might silently be giving wrong predictions today. For each, name the monitoring signal that would detect it.

🧠 Debugging Scenario

Problem: Model was trained and works in the notebook but nobody can reproduce the same accuracy two weeks later

Signal: A colleague tries to retrain the model using "the same code" and gets accuracy of 0.74 vs original 0.91.
Root cause 1 — No data versioning: The training dataset was modified after the original run. DVC or Azure Data Asset snapshot was not created, so the exact training data is lost.
Root cause 2 — No environment pinning: scikit-learn was 1.2 when the model was trained; the new environment has 1.4, which changed a hyperparameter default.
Root cause 3 — No run tracking: The original hyperparameters were not logged to MLflow. The colleague used different settings.
Fix: Immediately start logging every run with MLflow (params + metrics + environment). Use mlflow.log_artifact("requirements.txt") to capture library versions. Adopt DVC for dataset versioning. Create a requirements.txt or conda.yml as part of model artifacts.
Prevention: Make experiment tracking non-optional in your team: no training run ships unless it has a tracked MLflow experiment with data version, hyperparameters, and metrics logged.

🎯 Interview Questions

Beginner

What is MLOps and how does it differ from DevOps?▾

DevOps automates the software development lifecycle (build, test, deploy). MLOps extends this to include the ML-specific lifecycle: data versioning, experiment tracking, model training, validation, registry, deployment, monitoring, and retraining. The key difference is that MLOps must manage data and model artifacts alongside code, and must handle model degradation over time — something that doesn't happen with traditional software.

What are the five main stages of the ML lifecycle?▾

1) Data management — collect, version, and prepare training data. 2) Model development — experiment tracking, training, evaluation. 3) Model packaging and registration — package the model with dependencies and register it with metadata. 4) Deployment — serve the model as a real-time endpoint or batch pipeline. 5) Monitoring and retraining — track prediction quality, detect drift, and retrain automatically when performance degrades.

Why do most ML projects fail to reach production?▾

The most common reasons are: no reproducibility (can't recreate the model), no deployment infrastructure (the model lives only in a notebook), no monitoring (nobody knows when it starts failing), and misalignment between offline metrics and real-world performance. Research environments optimise for flexibility and speed; production requires reliability, reproducibility, and maintainability — MLOps bridges this gap.

What is experiment tracking and why is it important?▾

Experiment tracking records every training run's parameters, metrics, environment, and artifacts. It answers: "which model version works best and how was it built?" Without tracking, you can't compare runs, reproduce results, or understand what changed between two model versions. Tools like MLflow, Azure ML, and W&B provide experiment tracking dashboards for teams.

What is a Model Registry?▾

A Model Registry is a centralised catalog that stores model versions with their metadata: training run ID, metrics, data version, author, and deployment status. It manages model lifecycle stages (Development → Staging → Production → Archived). It ensures there's one source of truth for "what model is running in production right now" and enables safe promotion and rollback of model versions.

Intermediate

What is the difference between model drift and data drift?▾

Data drift (also called input drift or feature drift) means the distribution of the input features at inference time has shifted away from what the model was trained on. Example: the model was trained on customer ages 25-40, but now mostly 50-65 customers use the product. Concept drift means the relationship between features and the target label has changed — the world changed, not just the data. Example: fraud patterns changed after a new payment system launched. Data drift is often detectable statistically; concept drift requires ground truth labels to detect.

How do you make ML experiments reproducible?▾

Five requirements: 1) Fix random seeds everywhere (numpy, torch, sklearn). 2) Version the training dataset — tag or hash the exact data snapshot. 3) Log all hyperparameters to experiment tracker. 4) Pin the Python environment in requirements.txt or conda.yml, and log it as a run artifact. 5) Version control all training code (never train from uncommitted changes). With these five in place, any run can be exactly reproduced from its ID months later.

What is the champion-challenger model deployment pattern?▾

The champion model is the current production model. A challenger is a newly trained model candidate. In champion-challenger deployment, both models receive traffic simultaneously (usually a small percentage to the challenger, like 5-10%). Their predictions and outcomes are compared against each other. If the challenger consistently outperforms the champion over a defined period, it is promoted to champion and the old model archived. This de-risks model updates and enables A/B testing of model versions in production.

What are model validation gates in an MLOps CI/CD pipeline?▾

Validation gates are automated checks a model must pass before it can advance to the next pipeline stage. Common gates: 1) Accuracy/F1 must exceed a minimum threshold. 2) Model must outperform the current production champion by at least 1%. 3) Prediction latency must be under 100ms. 4) Model must pass bias/fairness checks across protected attributes. 5) Data validation — no nulls, expected feature distributions. If any gate fails, the pipeline stops and engineers are notified.

How does MLOps change when you move from batch to real-time inference?▾

Batch inference (scores data in bulk, scheduled) has relaxed latency requirements — models can be large and complex. Real-time inference (REST API, <100ms SLA) requires: model optimisation (quantisation, ONNX conversion), efficient serving (TorchServe, Triton, Azure ML Managed Online Endpoint), auto-scaling, and low-latency feature serving from an online feature store. Monitoring also differs: batch monitors output distributions daily; real-time requires per-request latency/error monitoring and canary deployment to control rollout risk.

Scenario-based

A business-critical model was deployed 8 months ago and shows 15% lower accuracy than at launch. How do you investigate and fix it?▾

First, diagnose: check if input feature distributions have drifted from training data using a statistical drift test (KS test or PSI). Gather actual ground truth labels if available and compute real accuracy. Review if the business context changed (new products, seasonal patterns, regulation changes). Then fix: if data drift is the cause, retrain on recent data. If concept drift, consider feature engineering updates or a different model architecture. Validate the new model outperforms the current champion before promoting. Add drift monitoring so this is detected automatically in future.

Your team deploys a new model version and within 30 minutes the API returns 503 errors. What is your incident response?▾

Immediate: trigger the blue-green rollback to immediately restore the previous stable model to 100% traffic. This restores service in under 2 minutes. Then investigate: check deployment logs for the new model container (image build failures? memory limit exceeded? dependency error?). Check if the new model is loading correctly in the endpoint. Reproduce the error in staging. Fix root cause. Write a pre-deployment smoke test that would have caught this error before promotion to production. Redeploy only after the smoke test passes.

How would you convince a data science team that currently runs Jupyter notebooks to adopt MLOps practices?▾

Start with a concrete business case: show them the cost of the last time someone couldn't reproduce a model or had to manually retrain and redeploy. Quantify it in hours. Then introduce incrementally: first add MLflow experiment tracking (zero friction, just 3 lines of code). After 2 weeks, show how it makes comparing models trivial. Then add a model registry. Then automate retraining. Never introduce all practices at once — the overhead feels too high. Each increment should solve a pain they already feel. The goal is to be the team that ships reliable models fast, not to be bureaucratic.

A model trained last year works great on historical data but fails in production today. How do you prevent this in future deployments?▾

This is the "training-serving skew" problem. Prevention strategy: 1) Deploy model monitoring on day 1 — never deploy without monitoring. 2) Compare inference feature distributions with training distributions weekly using PSI or KS tests. 3) Set up ground truth collection: even a 1% sample of predictions compared to actual outcomes gives early drift signal. 4) Define a model SLO: "if accuracy drops below X% or drift score exceeds Y, automatically retrain." 5) Schedule periodic proactive retraining quarterly even without drift — models should be refreshed regularly. The goal is treating model staleness as a known operational risk, not a surprise.

Your company processes 10 million predictions per day. How does this scale consideration change your MLOps architecture?▾

At this scale: 1) Model serving must use a dedicated inference server (Azure ML Managed Online Endpoints with auto-scaling, or Triton Inference Server) behind a load balancer. 2) Feature serving requires an online feature store with sub-millisecond latency. 3) Monitoring can't log every prediction — sample 1% or compute rolling statistics. 4) Batch retraining on 10M daily predictions requires distributed compute (Azure ML cluster, Spark). 5) A/B testing and canary deployment must be statistically sound at high volume — even 1% of traffic is 100K daily predictions to compare. 6) Model artifacts need versioned storage (Azure Blob, S3) with TTL policies to avoid storage explosion.

🌐 Real-world Usage

Netflix uses MLOps to manage 1,000+ ML models in production for recommendation, content scoring, and streaming quality. Microsoft Azure ML implements the full MLOps lifecycle with built-in experiment tracking, model registry, and managed endpoints. Airbnb's ML Platform (Bighead) was one of the first enterprise MLOps platforms — it standardised the entire lifecycle from feature engineering to model serving across hundreds of models. Spotify uses MLOps to manage playlist recommendation and podcast discovery models, with automated retraining triggered by changes in listener behaviour.

📝 Summary

MLOps is the engineering discipline that takes machine learning from notebook experiments to reliable production systems. The ML lifecycle has five stages: data, training, packaging, deployment, and monitoring. MLOps maturity ranges from Level 0 (manual notebooks) to Level 3 (fully automated, drift-triggered retraining). Start by adding experiment tracking — it costs almost nothing but gives you immediate reproducibility.

← Back to Course NextML Training, Validation, and Evaluation