BeginnerLesson 2 of 16

ML Model Training, Validation, and Evaluation

Learn the engineering process behind model quality: how to split data correctly, evaluate models with the right metrics, track experiments, and define automated validation gates that block bad models from reaching production.

🧒 Simple Explanation (ELI5)

Think of training a model like teaching a student by showing them thousands of examples. Then you give the student practice problems they haven't seen — that's validation. Finally, there's the final exam with completely new questions — that's testing. A student who only memorises the textbook (overfitting) fails the exam. A student who learns principles generalises to new questions and passes.

In ML, we split data so the model can never "peek" at the exam during studying. If it peeks (data leakage), the exam score is meaningless.

🌍 Real-world Analogy

A new drug goes through preclinical testing (training data), Phase 1-2 clinical trials (validation), and Phase 3 double-blind trials (test set). The trial design prevents cheating: researchers don't know which patients got the drug (blinding). In ML, the test set is the final trial — used only once, never during development. Breaking this rule is the equivalent of rigging a clinical trial and claiming the drug works.

⚙️ Data Splitting: The Non-Negotiable Rules

Standard 80/10/10 Split

For Time-Series Data — NEVER Split Randomly

If your data has a time dimension (financial data, user events, telemetry), always split chronologically. Train on the past, validate on the near future, test on the most recent data. Random splits cause data leakage — the model "knows" future information during training and appears impossibly accurate.

Stratified Splitting for Imbalanced Classes

If 95% of examples are class 0 and 5% are class 1 (fraud detection, medical diagnosis), use stratified splitting to ensure all splits maintain the same class ratio. Without this, the validation set might have very few minority samples and give misleading evaluation.

📊 Visual: Training Pipeline with Tracked Experiments

Input → Train → Validate → Gate → Register
📊 Versioned Dataset
DVC / Azure Data Asset
🧪 Model Training
MLflow run tracking
📈 Metric Evaluation
Acc / F1 / AUC / RMSE
✅ Validation Gate
Must beat baseline
📦 Register Model
MLflow Registry

📏 Choosing the Right Evaluation Metric

Problem TypePrimary MetricWhen to use
Classification (balanced)Accuracy, F1Equal class distribution, general classification
Classification (imbalanced)Precision, Recall, AUC-ROC, PR-AUCFraud detection, medical diagnosis, rare events
RegressionRMSE, MAE, R²Predicting prices, demand forecasting
RankingNDCG, MAP, MRRSearch, recommendations
NLP / LLMBLEU, ROUGE, perplexity, BERT scoreTranslation, summarization
⚠️
Accuracy is Misleading for Imbalanced Data

A model that predicts "no fraud" for every transaction achieves 99.9% accuracy on a dataset where 0.1% are fraud. The accuracy metric is useless here. For fraud detection, optimise for Recall (catching real fraud) or F1 (balance of precision and recall). Always check class distribution before choosing your metric.

⌨️ Production-Grade Training Script with MLflow

python
"""
Production ML training script with:
- MLflow experiment tracking
- Proper stratified split
- Validation gate (model must beat baseline)
- Model registration only on pass
"""
import mlflow, mlflow.sklearn
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
import json, os

# ── Configuration ─────────────────────────────────────────────────────────────
MIN_F1_THRESHOLD    = 0.80   # validation gate: must exceed this
BASELINE_F1         = 0.75   # current champion's score
EXPERIMENT_NAME     = "churn-prediction-v2"
MODEL_NAME          = "churn-classifier"

# ── Data (replace with real data pipeline) ───────────────────────────────────
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15,
                           n_classes=2, weights=[0.85, 0.15], random_state=42)
# Stratified split — preserves class ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.125, stratify=y_train, random_state=42
)

print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")
print(f"Class balance (train): {np.bincount(y_train)}")

# ── MLflow tracking ───────────────────────────────────────────────────────────
mlflow.set_experiment(EXPERIMENT_NAME)

with mlflow.start_run(run_name="gradient-boosting-v1") as run:
    # ── Hyperparameters ───────────────────────────────────────────────────────
    params = {
        "n_estimators": 200, "max_depth": 4,
        "learning_rate": 0.05, "subsample": 0.8, "random_state": 42
    }
    mlflow.log_params(params)

    # ── Train ─────────────────────────────────────────────────────────────────
    model = GradientBoostingClassifier(**params)
    model.fit(X_train, y_train)

    # ── Evaluate on validation set ────────────────────────────────────────────
    val_preds  = model.predict(X_val)
    val_proba  = model.predict_proba(X_val)[:, 1]
    val_f1     = f1_score(y_val, val_preds, average='weighted')
    val_auc    = roc_auc_score(y_val, val_proba)
    mlflow.log_metrics({"val_f1": val_f1, "val_auc": val_auc})

    # ── Validation gate ───────────────────────────────────────────────────────
    passed = val_f1 >= MIN_F1_THRESHOLD and val_f1 > BASELINE_F1
    mlflow.log_param("validation_gate_passed", passed)
    print(f"\nVal F1: {val_f1:.4f} | Val AUC: {val_auc:.4f}")
    print(f"Baseline F1: {BASELINE_F1} | Gate: {'PASSED ✅' if passed else 'FAILED ❌'}")

    if passed:
        # ── Evaluate on test set (only once, after gate passes) ───────────────
        test_preds = model.predict(X_test)
        test_f1    = f1_score(y_test, test_preds, average='weighted')
        mlflow.log_metric("test_f1", test_f1)

        # ── Register model ────────────────────────────────────────────────────
        mlflow.sklearn.log_model(
            model,
            artifact_path="model",
            registered_model_name=MODEL_NAME,
            input_example=X_test[:5]
        )
        print(f"\nModel registered: {MODEL_NAME}")
        print(f"Test F1: {test_f1:.4f}")
        print(classification_report(y_test, test_preds))
    else:
        print("\nModel did not pass validation gate. NOT registered.")
        print(f"Required F1 >= {MIN_F1_THRESHOLD} and > {BASELINE_F1}")
bash
# Run the training script and watch experiments in MLflow
python train.py

# List registered models
mlflow models list --name churn-classifier

# Load a registered model version for inference
python -c "
import mlflow
model = mlflow.sklearn.load_model('models:/churn-classifier/Production')
print(model.predict([[0.1]*20]))
"

🧪 Hands-on

  1. Run the training script above. Verify the model appears in MLflow with val_f1 and val_auc metrics.
  2. Lower MIN_F1_THRESHOLD to 0.60 and re-run. Confirm the model is now registered. Then raise it back to 0.85 and re-run — confirm it fails the gate and is NOT registered.
  3. Replace GradientBoostingClassifier with LogisticRegression. Compare both runs in the MLflow UI. Which algorithm achieves higher AUC?
  4. Introduce data leakage deliberately: add a feature that is perfectly correlated with the target (e.g., X_train[:, 0] = y_train). Notice how accuracy jumps to near 100%. This shows why test isolation is critical.
  5. Add cross-validation: replace the single train/val split with 5-fold StratifiedKFold and log mean and std of fold F1 scores to MLflow. A high std means the model is unstable and shouldn't be promoted.

🎮 Try It Yourself

🎮
Challenge: Design a Validation Gate for a Real Use Case
  1. Choose a use case: fraud detection, customer churn, or equipment failure prediction. Write down the right evaluation metric and explain why accuracy alone would be misleading.
  2. Set validation thresholds: Define 3 gates the model must pass before being registered. Include at least one metric gate, one fairness gate (no >10% performance disparity across demographic groups), and one latency gate (<50ms per prediction).
  3. Test the baseline: Implement a naive baseline (always predicts the majority class). Compute its F1 and AUC. This is the minimum bar any ML model must beat — if it can't beat a naive baseline, it has no business value.
  4. Simulate overfitting: Train a decision tree with no depth limit. Compare train accuracy vs test accuracy. If train accuracy is 99% and test is 72%, the model is overfit. Observe what cross-validation tells you vs a single train/test split.

🧠 Debugging Scenarios

Problem: Model retraining always passes the gate even though it's getting worse in production

Problem: Your model achieves F1 of 0.98 on the test set but business stakeholders report it's wrong most of the time

🎯 Interview Questions

Beginner

What is the difference between training, validation, and test sets?

Training: used to fit model parameters (the model learns from this). Validation: used to tune hyperparameters and compare model variants during development — the model does NOT train on this. Test: used only once at the end, to give an honest estimate of real-world performance. The test set must never be used during development or hyperparameter tuning. Using it multiple times is a form of data leakage that inflates your estimate of model quality.

What is overfitting and how do you detect it?

Overfitting is when a model memorises the training data instead of learning general patterns. It performs extremely well on training data but poorly on new data. Detection: a large gap between training metrics and validation/test metrics (e.g., train accuracy 99%, val accuracy 72%) indicates overfitting. Cross-validation makes this signal more reliable. Fixes include: regularisation (L1/L2), reducing model complexity, more training data, dropout, or early stopping.

When would you use F1 score instead of accuracy?

When classes are imbalanced. In fraud detection (0.1% fraud), a model predicting "no fraud" always achieves 99.9% accuracy — useless for catching fraud. F1 is the harmonic mean of precision (how many flagged items are real positives) and recall (how many real positives you caught). For fraud or medical diagnosis, optimise for recall (don't miss real positives) while keeping precision acceptable. PR-AUC (precision-recall area under curve) is even better than F1 when working with severe imbalance.

What is cross-validation?

Cross-validation splits the training data into K folds. In each iteration, K-1 folds are used for training and 1 fold for validation. This repeats K times, rotating the validation fold. The final metric is the mean across all K folds, with standard deviation showing variance. Benefits: uses all data for both training and validation, produces a more reliable metric estimate, and exposes model instability (high std across folds). Standard K values are 5 and 10.

What is data leakage and why is it dangerous?

Data leakage is when information from the test/validation set, or information from the future, is accidentally available to the model during training. It creates artificially inflated metrics that don't reflect real-world performance. Common types: 1) Target leakage — a feature directly encodes the target. 2) Train-test contamination — normalization scaler fitted on all data including test. 3) Temporal leakage — future data used for historical records in time-series. Business risk: a model shows 94% accuracy in validation but 62% in production, wasting engineering time and business trust.

Intermediate

How do you handle class imbalance in training?

Multiple strategies depending on severity: 1) Class weights — pass class_weight='balanced' to sklearn classifiers to upweight minority class in the loss. 2) Resampling — oversample minority (SMOTE) or undersample majority. 3) Threshold tuning — default classification threshold is 0.5, but moving it lower increases recall on the minority class. 4) Use appropriate metrics: PR-AUC, F1, or ROC-AUC, not accuracy. 5) Collect more minority-class data if possible. For extremely rare events (<0.01%), consider anomaly detection approaches instead of supervised classification.

What metrics would you use to evaluate a regression model and what do they mean?

RMSE (Root Mean Square Error): penalises large errors heavily due to squaring — useful when large errors are unacceptable. MAE (Mean Absolute Error): average absolute error — more robust to outliers than RMSE, directly interpretable in the target unit (e.g., "$150 average error"). R² (coefficient of determination): proportion of variance explained by the model (1 = perfect, 0 = no better than predicting the mean). Choose based on business context: if a single large error is catastrophic, optimise RMSE. If all errors matter equally, use MAE. Always report all three in practice.

What is the purpose of a validation gate in an ML CI/CD pipeline?

A validation gate is an automated check that a newly trained model must pass before advancing to the next pipeline stage. It prevents bad models from reaching production. Typical gates: new model F1 must exceed minimum threshold AND must beat the current champion model by at least 1%; prediction latency under 100ms; no significant performance disparity across demographic groups; data quality checks pass. If any gate fails, the pipeline stops and engineers are notified. This is the ML equivalent of a code review blocking a PR merge.

Scenario-based

Your model achieves 96% validation accuracy but only 71% accuracy in production. What went wrong?

This pattern strongly suggests data leakage or training-serving skew. Investigate: 1) Are validation features computed with information unavailable at prediction time? Example: normalisation computed on all data including test, or a feature that encodes information from the future. 2) Is the validation data from the same distribution as production data? If validation was sampled from historical data and production is current real-time data, the distribution may have shifted. 3) Check feature pipelines: are all transformations applied identically in training and serving? Fixes: chronological split, strict feature computation rules, and a shadow deployment that compares model predictions on live traffic with offline performance.

How would you set up a repeatable, automated training pipeline for a model that retrains weekly?

1) Data: automate weekly data extraction, always version the dataset (DVC tag or Azure Data Asset version). Pin the train/val/test split boundaries to dates, not random samples. 2) Training: containerise the training script (Docker image with pinned requirements.txt). Run in Azure ML or Kubernetes. 3) Tracking: every run logs to MLflow with dataset version, params, and metrics. 4) Validation gate: compare new model vs current champion. Only register if gate passes. 5) Deployment: if gate passes, automatically promote to Staging endpoint. A/B test for 48 hours. If production metrics hold, promote to Production. 6) Alerting: if any stage fails, notify the team via Slack or PagerDuty. Full pipeline documented in code, stored in git, and auditable.

You're asked to evaluate a model for a medical diagnosis task. What metrics would you report and why?

For medical diagnosis, missing a real positive (false negative) is usually more dangerous than a false alarm (false positive). Prioritise: 1) Recall/sensitivity — how many of the truly sick patients did we catch? 2) Specificity — how many healthy patients were correctly cleared? 3) ROC-AUC — overall discrimination ability across all thresholds. 4) PR-AUC — especially important if the condition is rare. 5) Calibration — is a "70% confidence" prediction actually correct 70% of the time? Well-calibrated probabilities matter for treatment decisions. Report all these, not just accuracy. Explicitly state at what threshold the model operates and justify it clinically.

📝 Summary

Model quality starts with correctly split data (stratified, chronologically for time-series, with isolated test sets). The right metric depends on the business problem — accuracy is often misleading. Automated validation gates enforce a minimum quality bar before any model reaches production. Always track every training run with MLflow and never touch the test set until the final evaluation.