IntermediateLesson 6 of 16

Anomaly Detection in Production Systems

Detect unusual metric patterns, latency drifts, and error rate spikes before they become customer-facing incidents — using statistical and ML-based approaches.

🧒 Simple Explanation (ELI5)

Imagine you commute to work every day and the trip takes between 20-35 minutes depending on traffic. One Tuesday it takes 90 minutes. You immediately know something is wrong — a road closure, an accident, something unusual. Your brain detected an anomaly because it knows what "normal" looks like for your commute.

Anomaly detection gives computers this same capability: learn what "normal" looks like for each metric (CPU, response time, error rate), then automatically flag when something falls outside the expected range — before a human would notice.

🔧 Why Anomaly Detection Outperforms Static Thresholds

🌍 Real-world Analogy

A cardiologist monitors a patient's heart rhythm over weeks. They know that 55-85 BPM at rest is normal for this patient. An unusual rhythm at 2am — even if technically within normal population range — gets flagged because it's unusual for this specific patient. That's per-entity anomaly detection: personalised baselines for each service, not population-wide rules.

⚙️ Anomaly Detection Approaches

1. Statistical Methods (Fast, Explainable)

2. Time-Series Models (Handles Seasonality)

3. ML Models (Complex Multi-variate)

📊 Visual: Anomaly Detection Approaches Comparison

Choosing an Anomaly Detection Method
Statistical
Z-score, IQR
✅ Fast, explainable
✅ No training needed
❌ No seasonality
VS
Time-Series (Prophet)
Trend + Seasonality
✅ Daily/weekly patterns
✅ Confidence intervals
❌ Slower to retrain
VS
ML (Isolation Forest)
Multi-variate
✅ Correlated signals
✅ No explicit rules
❌ Needs training data

⚡ Kubernetes Integration Flow: Input → AI → Action

How anomaly detection drives automated scaling and alerting in an AKS cluster:

K8s Anomaly Flow: Prometheus Metric → AI Scoring → HPA Scale-out
📊 Prometheus Scrape
CPU 92% (baseline 45%)
🔍 Feature Extraction
z-score + IQR + IsForest
🚨 Anomaly Score 0.97
2/3 methods agree
🤖 Auto-Action
HPA trigger / Slack alert
⚖️ K8s HPA
Replicas 3 → 5
bash
# Step 1: Query Prometheus for recent CPU metrics (last 30 minutes, 1-min resolution)
curl -s 'http://prometheus:9090/api/v1/query_range' \
  --data-urlencode 'query=rate(container_cpu_usage_seconds_total{namespace="prod",pod=~"payment-api.*"}[5m]) * 100' \
  --data-urlencode 'start=now-30m' --data-urlencode 'end=now' --data-urlencode 'step=60' \
  | jq '.data.result[0].values' > /tmp/cpu_metrics.json

# Step 2: Run anomaly detector (Python script from code section)
python3 anomaly_detect.py --input /tmp/cpu_metrics.json --output /tmp/anomaly_result.json

# Step 3a: If anomaly detected, trigger HPA scale-out
ANOMALY=$(jq '.is_anomaly' /tmp/anomaly_result.json)
if [ "$ANOMALY" = "true" ]; then
  kubectl scale deployment/payment-api -n prod --replicas=5
  echo "Auto-scaled payment-api to 5 replicas (CPU anomaly detected)"
fi

# Step 3b: Simultaneously post alert to Slack
curl -X POST "$SLACK_WEBHOOK_URL" \
  -H 'Content-Type: application/json' \
  -d '{"text":"🚨 CPU anomaly on payment-api (score: 0.97). Auto-scaled 3→5 replicas. Investigate: kubectl top pods -n prod"}'

⌨️ Multi-method Anomaly Detection

python
"""
Production-grade anomaly detection combining statistical and ML methods.
Designed for real-time metric monitoring.
"""
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from scipy import stats

class ProductionAnomalyDetector:
    def __init__(self, window_minutes: int = 60):
        self.window = window_minutes
        self.iso_forest = IsolationForest(contamination=0.02, random_state=42)
        self.is_trained = False
        self.baselines = {}

    # ── Statistical: Z-score with rolling window ──────────────────────────────
    def z_score_anomaly(self, series: pd.Series, threshold: float = 3.0) -> pd.Series:
        """Flag points more than `threshold` std deviations from rolling mean."""
        rolling_mean = series.rolling(self.window, min_periods=10).mean()
        rolling_std  = series.rolling(self.window, min_periods=10).std()
        z_scores = (series - rolling_mean) / rolling_std.clip(lower=0.001)
        return z_scores.abs() > threshold

    # ── Statistical: IQR method (robust to outliers in training window) ───────
    def iqr_anomaly(self, series: pd.Series, multiplier: float = 2.5) -> pd.Series:
        """Flag points outside [Q1 - m*IQR, Q3 + m*IQR] computed on rolling window."""
        def iqr_check(window_data):
            if len(window_data) < 10:
                return False
            q1, q3 = np.percentile(window_data, [25, 75])
            iqr = q3 - q1
            lower, upper = q1 - multiplier * iqr, q3 + multiplier * iqr
            return (window_data.iloc[-1] < lower) or (window_data.iloc[-1] > upper)
        return series.rolling(self.window).apply(iqr_check, raw=False).fillna(False).astype(bool)

    # ── ML: Isolation Forest on multi-variate features ────────────────────────
    def train_isolation_forest(self, df: pd.DataFrame, feature_cols: list[str]) -> None:
        """Train on recent normal data."""
        self.iso_forest.fit(df[feature_cols].dropna())
        self.feature_cols = feature_cols
        self.is_trained = True
        print(f"Trained Isolation Forest on {len(df)} samples, {len(feature_cols)} features")

    def ml_anomaly(self, df: pd.DataFrame) -> pd.Series:
        """Return True for rows that Isolation Forest considers anomalous."""
        if not self.is_trained:
            raise RuntimeError("Call train_isolation_forest() first")
        preds = self.iso_forest.predict(df[self.feature_cols].fillna(0))
        return pd.Series(preds == -1, index=df.index)

    # ── Combined: Vote across methods ─────────────────────────────────────────
    def detect(self, df: pd.DataFrame) -> pd.DataFrame:
        """Run all methods; flag row as anomaly if 2+ methods agree."""
        df = df.copy()
        df['z_anomaly']   = self.z_score_anomaly(df['cpu_pct'])
        df['iqr_anomaly']  = self.iqr_anomaly(df['latency_ms'])
        if self.is_trained:
            df['ml_anomaly'] = self.ml_anomaly(df)
        else:
            df['ml_anomaly'] = False

        df['anomaly_votes'] = df[['z_anomaly', 'iqr_anomaly', 'ml_anomaly']].sum(axis=1)
        df['is_anomaly']    = df['anomaly_votes'] >= 2   # require 2/3 consensus

        return df

# ── Usage Example ─────────────────────────────────────────────────────────────
np.random.seed(42)
n = 500
df = pd.DataFrame({
    'cpu_pct':    np.random.normal(45, 8, n).clip(5, 100),
    'latency_ms': np.random.normal(120, 20, n).clip(10, 2000),
    'error_rate': np.random.exponential(0.5, n).clip(0, 50),
})
# Inject anomalies at specific points
df.loc[480:485, 'cpu_pct'] = 92
df.loc[480:485, 'latency_ms'] = 850

detector = ProductionAnomalyDetector(window_minutes=60)
detector.train_isolation_forest(df.iloc[:400], ['cpu_pct', 'latency_ms', 'error_rate'])
results = detector.detect(df)

anomaly_rows = results[results['is_anomaly']]
print(f"Anomalies detected: {len(anomaly_rows)}")
print(anomaly_rows[['cpu_pct', 'latency_ms', 'anomaly_votes']].tail(10))

🧪 Hands-on

  1. Run the code above and verify anomalies are detected at rows 480-485 with anomaly_votes >= 2.
  2. Inject a slow drift anomaly: gradually increase CPU from row 400 to 490 from 45% to 85%. Which method catches it first? (Z-score should since it uses rolling window)
  3. Simulate a Black Friday spike: multiply all values in rows 200-220 by 1.5. This should NOT be flagged as anomalous because all signals spike together. Add a feature check to distinguish coordinated load spikes from individual component failures.
  4. Export real Prometheus metric data using promtool query range --start=2h --end=1h 'rate(http_requests_total[5m])' and run your detector against it.
  5. Tune the contamination parameter from 0.01 to 0.05 and observe how many more anomalies are detected.
💡
Voting Ensemble Pattern

Never rely on a single anomaly detection method in production. Different methods excel in different scenarios — z-score catches sudden spikes, IQR handles skewed distributions, Isolation Forest catches multi-variate patterns. Requiring 2 out of 3 methods to agree dramatically reduces false positives while maintaining recall for real incidents.

🎮 Try It Yourself

🎮
Challenge: Tune the Detector and Simulate a K8s Incident
  1. Run the multi-method detector from the code section. Verify that anomalies at rows 480–485 are caught with anomaly_votes >= 2. Note which methods voted yes.
  2. Simulate a slow memory leak: Add a gradual drift to latency_ms from row 300 to 490: df.loc[300:490, 'latency_ms'] += np.linspace(0, 200, 191). Run the detector. Does it catch the drift? Which method fires first?
  3. Simulate a false positive from a batch job: Spike ALL metrics at once for rows 200–220 (multiply CPU, latency, and error_rate by 1.5 simultaneously). This should not be flagged as an anomaly if you add a rule: "if all 3 features spike together, it may be a coordinated load event, not a failure." Implement this suppression logic.
  4. Kubernetes HPA integration drill: Imagine your detector output triggers the kubectl scale command above. Write a Python wrapper that: reads the detector output, checks if is_anomaly=True AND cpu_pct > 80, then prints the exact kubectl scale command it would run (don't actually execute it — just print it).
  5. Weekly seasonality test: Add day-of-week to your feature set. Run the detector on data with a deliberate Sunday night valley (multiply all values by 0.4 for timestamps matching Sunday 21:00–23:00). Verify the dip is not flagged as anomalous when a day-specific baseline is used.

🧠 Debugging Scenario

Problem: Your anomaly detector fires every 7 days, always on Sunday evening at ~9pm UTC, even though there's no real incident.

🎯 Interview Questions

Beginner

What is the difference between static threshold alerting and anomaly detection?

Static thresholds use fixed values (CPU > 80% = alert) that don't adapt to context. Anomaly detection learns what "normal" looks like for each service at each time of day and flags deviations from the learned baseline. Static thresholds produce false positives during expected load spikes and miss slow gradual degradation. Anomaly detection adapts dynamically to changing traffic patterns.

What is a z-score and how is it used for anomaly detection?

A z-score measures how many standard deviations a value is from the mean: z = (x - mean) / std. For anomaly detection: compute rolling mean and standard deviation over a historical window. If the current value has z-score > 3 (more than 3 standard deviations from recent average), it's likely anomalous. Z-score is fast and explainable but assumes data is roughly normally distributed.

Why is it important to detect anomalies before an SLO breaches?

SLO breaches mean users are already experiencing degradation. Early anomaly detection catches the leading indicators — rising latency, increasing error rates, memory leak trends — before they cross SLO thresholds. This gives engineers time to investigate and potentially prevent the breach. By the time an SLO alert fires, damage is already done.

What is "seasonal" anomaly detection?

Seasonal anomaly detection accounts for predictable cyclical patterns in metrics — daily traffic cycles (high daytime, low overnight), weekly patterns (lower on weekends), and annual patterns (Black Friday). A metric that is 60% CPU at 9am Monday is normal; 60% at 3am Sunday may be anomalous. Seasonal models compute expected values for each time slot and flag deviations from the time-specific baseline.

What is the contamination parameter in Isolation Forest?

Contamination is the expected fraction of anomalous data points in the training set — it sets the decision threshold. Set it to match your expected anomaly rate in production. Too low (0.001): model misses real anomalies. Too high (0.1): model flags normal points as anomalous. Typical production value is 0.01-0.03 (1-3% anomaly rate). Validate by running the trained model against a labeled dataset.

Intermediate

How do you handle concept drift in a production anomaly detector?

Concept drift occurs when normal behaviour changes (new feature ships, traffic scales up). Detect it by monitoring: 1) Model anomaly rate — if > 5% of recent points are flagged as anomalous by a well-tuned model, normal may have changed. 2) Feature distribution drift via PSI. Strategy: use sliding window retraining (retrain weekly on last 30 days of data). Use changepoint detection (e.g., PELT algorithm) to detect when the baseline shifts and trigger automatic retraining.

What is the difference between anomaly detection and outlier detection?

In practice they're often used interchangeably, but technically: outlier detection identifies individual data points that differ from the population (one extreme CPU reading). Anomaly detection considers temporal context — a sequence of readings that are individually normal but together represent an unusual pattern (gradual memory leak across 2 hours). For AIOps, temporal anomaly detection is more useful because most production degradation patterns develop over time, not as single point spikes.

How do you prevent false positives from expected events like scheduled batch jobs?

Maintain a maintenance/event calendar that the anomaly detector consults before firing. Integrate with your deployment system (GitHub Actions webhook) and change management calendar. When a batch job runs 2am-4am every night, add a suppression window for those time slots for relevant metrics (CPU, disk I/O). Build a "known events" API in your monitoring stack that anomaly systems check before alerting. This dramatically reduces Sunday-night false positives and deployment-correlated alerts.

Scenario-based

Your anomaly detector fires on every large marketing campaign launch but these aren't incidents. How do you fix this without missing real anomalies?

1) Integrate with your campaign calendar — before launch day, ingest the expected traffic multiplier. 2) Widen the baseline bounds during campaign windows (mean + 3×std × traffic_multiplier). 3) Use rate-of-change features: a campaign causes a gradual build-up (expected); an incident often causes a sudden spike within minutes. Detect shape of spike, not just magnitude. 4) After 3-4 campaign events, retrain the model with campaign data labeled "normal" so it learns the pattern.

Two services show CPU anomalies at the same time. How do you determine correlation vs coincidence?

Check service dependency graph: do these services share infrastructure, database, or network path? Compute cross-correlation on their anomaly windows — do both spike within the same 5-minute window? If service A's anomaly precedes service B's by 2-3 minutes, and A calls B, then A's failure likely caused B's — it's a cascading failure, not coincidence. Use distributed tracing to confirm. Alert as a single grouped incident, not two separate ones.

How would you explain to a non-technical stakeholder why the AI model sometimes "misses" anomalies?

Frame it as a "new threat" problem, like a spam filter: it catches 98% of spam it has seen before, but a brand-new spam technique sneaks through until the filter is retrained. Similarly, our anomaly model learns from historical incidents — if a new type of failure hasn't happened before, the model won't recognise it as anomalous. We treat these as opportunities: when we catch a novel issue manually, we add it to the training data so the model learns to catch it next time. Over months, the model gets smarter about our specific system.

🌐 Real-world Usage

Azure Monitor uses per-resource dynamic thresholds — it learns a 4-week baseline for each metric on each specific resource, accounting for time-of-day and day-of-week patterns. Netflix uses a multi-layer anomaly detection stack: statistical models for real-time detection (low latency), ML models for seasonal pattern anomalies (hourly retrain), and LLM-based analysis for complex correlated failures. Dynatrace Davis AI uses causal AI to not just detect anomalies but automatically determine which anomaly caused which downstream impact.

📝 Summary

Effective production anomaly detection combines multiple approaches: statistical methods for speed and explainability, time-series models for seasonality awareness, and ML models for multi-variate correlation. Always use ensemble voting (2+ methods) and integrate a maintenance calendar to suppress expected spikes. The goal isn't zero false positives — it's raising the signal-to-noise ratio so engineers trust and act on alerts.