Anomaly Detection in Production Systems
Detect unusual metric patterns, latency drifts, and error rate spikes before they become customer-facing incidents — using statistical and ML-based approaches.
🧒 Simple Explanation (ELI5)
Imagine you commute to work every day and the trip takes between 20-35 minutes depending on traffic. One Tuesday it takes 90 minutes. You immediately know something is wrong — a road closure, an accident, something unusual. Your brain detected an anomaly because it knows what "normal" looks like for your commute.
Anomaly detection gives computers this same capability: learn what "normal" looks like for each metric (CPU, response time, error rate), then automatically flag when something falls outside the expected range — before a human would notice.
🔧 Why Anomaly Detection Outperforms Static Thresholds
- Static thresholds are blind to context: CPU > 80% is an alert — but if your batch job runs every night from 2-4am, that's normal. A static threshold pages on-call at 2am unnecessarily.
- Patterns change: A new feature ships. Traffic increases 3x. Now your old 80% CPU threshold fires every deployment. You're constantly chasing threshold adjustments instead of real problems.
- Slow degradation is missed: If your database response time drifts from 20ms to 350ms over 2 weeks, no single minute ever crosses a static threshold — but AI detects the gradual drift.
- Different services have different baselines: Your search API and your batch data processor have completely different normal CPU profiles. Static thresholds can't handle this at scale.
🌍 Real-world Analogy
A cardiologist monitors a patient's heart rhythm over weeks. They know that 55-85 BPM at rest is normal for this patient. An unusual rhythm at 2am — even if technically within normal population range — gets flagged because it's unusual for this specific patient. That's per-entity anomaly detection: personalised baselines for each service, not population-wide rules.
⚙️ Anomaly Detection Approaches
1. Statistical Methods (Fast, Explainable)
- Z-score: Flag values more than N standard deviations from the rolling mean. Fast and explainable, but assumes normal distribution.
- IQR (Interquartile Range): More robust than z-score for skewed data. Flag values outside [Q1 - 1.5×IQR, Q3 + 1.5×IQR].
- Rolling standard deviation: Compute moving window mean and std. Alert when current value > mean + 3×std.
2. Time-Series Models (Handles Seasonality)
- Prophet (Facebook): Decomposes the time series into trend + seasonality + holidays. Detects anomalies as values outside predicted confidence interval. Excellent for metrics with daily/weekly patterns (API traffic).
- ARIMA: Classic time-series forecasting. Predicts next values based on past patterns. Flag when actual exceeds predicted by N standard errors.
- Seasonal Decomposition: Decompose metric into seasonal, trend, and residual components. Anomalies appear in the residual component.
3. ML Models (Complex Multi-variate)
- Isolation Forest: Scores data points by how easily they can be isolated. Effective for multi-variate anomalies (CPU + memory + latency together).
- LSTM Autoencoder: Learns to reconstruct normal patterns. High reconstruction error = anomaly. Good for correlated time-series across services.
- One-Class SVM: Learns the boundary of normal data. Flags anything outside the boundary. Requires careful hyperparameter tuning.
📊 Visual: Anomaly Detection Approaches Comparison
⚡ Kubernetes Integration Flow: Input → AI → Action
How anomaly detection drives automated scaling and alerting in an AKS cluster:
CPU 92% (baseline 45%)
z-score + IQR + IsForest
2/3 methods agree
HPA trigger / Slack alert
Replicas 3 → 5
# Step 1: Query Prometheus for recent CPU metrics (last 30 minutes, 1-min resolution)
curl -s 'http://prometheus:9090/api/v1/query_range' \
--data-urlencode 'query=rate(container_cpu_usage_seconds_total{namespace="prod",pod=~"payment-api.*"}[5m]) * 100' \
--data-urlencode 'start=now-30m' --data-urlencode 'end=now' --data-urlencode 'step=60' \
| jq '.data.result[0].values' > /tmp/cpu_metrics.json
# Step 2: Run anomaly detector (Python script from code section)
python3 anomaly_detect.py --input /tmp/cpu_metrics.json --output /tmp/anomaly_result.json
# Step 3a: If anomaly detected, trigger HPA scale-out
ANOMALY=$(jq '.is_anomaly' /tmp/anomaly_result.json)
if [ "$ANOMALY" = "true" ]; then
kubectl scale deployment/payment-api -n prod --replicas=5
echo "Auto-scaled payment-api to 5 replicas (CPU anomaly detected)"
fi
# Step 3b: Simultaneously post alert to Slack
curl -X POST "$SLACK_WEBHOOK_URL" \
-H 'Content-Type: application/json' \
-d '{"text":"🚨 CPU anomaly on payment-api (score: 0.97). Auto-scaled 3→5 replicas. Investigate: kubectl top pods -n prod"}'
⌨️ Multi-method Anomaly Detection
"""
Production-grade anomaly detection combining statistical and ML methods.
Designed for real-time metric monitoring.
"""
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from scipy import stats
class ProductionAnomalyDetector:
def __init__(self, window_minutes: int = 60):
self.window = window_minutes
self.iso_forest = IsolationForest(contamination=0.02, random_state=42)
self.is_trained = False
self.baselines = {}
# ── Statistical: Z-score with rolling window ──────────────────────────────
def z_score_anomaly(self, series: pd.Series, threshold: float = 3.0) -> pd.Series:
"""Flag points more than `threshold` std deviations from rolling mean."""
rolling_mean = series.rolling(self.window, min_periods=10).mean()
rolling_std = series.rolling(self.window, min_periods=10).std()
z_scores = (series - rolling_mean) / rolling_std.clip(lower=0.001)
return z_scores.abs() > threshold
# ── Statistical: IQR method (robust to outliers in training window) ───────
def iqr_anomaly(self, series: pd.Series, multiplier: float = 2.5) -> pd.Series:
"""Flag points outside [Q1 - m*IQR, Q3 + m*IQR] computed on rolling window."""
def iqr_check(window_data):
if len(window_data) < 10:
return False
q1, q3 = np.percentile(window_data, [25, 75])
iqr = q3 - q1
lower, upper = q1 - multiplier * iqr, q3 + multiplier * iqr
return (window_data.iloc[-1] < lower) or (window_data.iloc[-1] > upper)
return series.rolling(self.window).apply(iqr_check, raw=False).fillna(False).astype(bool)
# ── ML: Isolation Forest on multi-variate features ────────────────────────
def train_isolation_forest(self, df: pd.DataFrame, feature_cols: list[str]) -> None:
"""Train on recent normal data."""
self.iso_forest.fit(df[feature_cols].dropna())
self.feature_cols = feature_cols
self.is_trained = True
print(f"Trained Isolation Forest on {len(df)} samples, {len(feature_cols)} features")
def ml_anomaly(self, df: pd.DataFrame) -> pd.Series:
"""Return True for rows that Isolation Forest considers anomalous."""
if not self.is_trained:
raise RuntimeError("Call train_isolation_forest() first")
preds = self.iso_forest.predict(df[self.feature_cols].fillna(0))
return pd.Series(preds == -1, index=df.index)
# ── Combined: Vote across methods ─────────────────────────────────────────
def detect(self, df: pd.DataFrame) -> pd.DataFrame:
"""Run all methods; flag row as anomaly if 2+ methods agree."""
df = df.copy()
df['z_anomaly'] = self.z_score_anomaly(df['cpu_pct'])
df['iqr_anomaly'] = self.iqr_anomaly(df['latency_ms'])
if self.is_trained:
df['ml_anomaly'] = self.ml_anomaly(df)
else:
df['ml_anomaly'] = False
df['anomaly_votes'] = df[['z_anomaly', 'iqr_anomaly', 'ml_anomaly']].sum(axis=1)
df['is_anomaly'] = df['anomaly_votes'] >= 2 # require 2/3 consensus
return df
# ── Usage Example ─────────────────────────────────────────────────────────────
np.random.seed(42)
n = 500
df = pd.DataFrame({
'cpu_pct': np.random.normal(45, 8, n).clip(5, 100),
'latency_ms': np.random.normal(120, 20, n).clip(10, 2000),
'error_rate': np.random.exponential(0.5, n).clip(0, 50),
})
# Inject anomalies at specific points
df.loc[480:485, 'cpu_pct'] = 92
df.loc[480:485, 'latency_ms'] = 850
detector = ProductionAnomalyDetector(window_minutes=60)
detector.train_isolation_forest(df.iloc[:400], ['cpu_pct', 'latency_ms', 'error_rate'])
results = detector.detect(df)
anomaly_rows = results[results['is_anomaly']]
print(f"Anomalies detected: {len(anomaly_rows)}")
print(anomaly_rows[['cpu_pct', 'latency_ms', 'anomaly_votes']].tail(10))
🧪 Hands-on
- Run the code above and verify anomalies are detected at rows 480-485 with
anomaly_votes >= 2. - Inject a slow drift anomaly: gradually increase CPU from row 400 to 490 from 45% to 85%. Which method catches it first? (Z-score should since it uses rolling window)
- Simulate a Black Friday spike: multiply all values in rows 200-220 by 1.5. This should NOT be flagged as anomalous because all signals spike together. Add a feature check to distinguish coordinated load spikes from individual component failures.
- Export real Prometheus metric data using
promtool query range --start=2h --end=1h 'rate(http_requests_total[5m])'and run your detector against it. - Tune the
contaminationparameter from 0.01 to 0.05 and observe how many more anomalies are detected.
Never rely on a single anomaly detection method in production. Different methods excel in different scenarios — z-score catches sudden spikes, IQR handles skewed distributions, Isolation Forest catches multi-variate patterns. Requiring 2 out of 3 methods to agree dramatically reduces false positives while maintaining recall for real incidents.
🎮 Try It Yourself
- Run the multi-method detector from the code section. Verify that anomalies at rows 480–485 are caught with
anomaly_votes >= 2. Note which methods voted yes. - Simulate a slow memory leak: Add a gradual drift to
latency_msfrom row 300 to 490:df.loc[300:490, 'latency_ms'] += np.linspace(0, 200, 191). Run the detector. Does it catch the drift? Which method fires first? - Simulate a false positive from a batch job: Spike ALL metrics at once for rows 200–220 (multiply CPU, latency, and error_rate by 1.5 simultaneously). This should not be flagged as an anomaly if you add a rule: "if all 3 features spike together, it may be a coordinated load event, not a failure." Implement this suppression logic.
- Kubernetes HPA integration drill: Imagine your detector output triggers the kubectl scale command above. Write a Python wrapper that: reads the detector output, checks if
is_anomaly=TrueANDcpu_pct > 80, then prints the exactkubectl scalecommand it would run (don't actually execute it — just print it). - Weekly seasonality test: Add day-of-week to your feature set. Run the detector on data with a deliberate Sunday night valley (multiply all values by 0.4 for timestamps matching Sunday 21:00–23:00). Verify the dip is not flagged as anomalous when a day-specific baseline is used.
🧠 Debugging Scenario
Problem: Your anomaly detector fires every 7 days, always on Sunday evening at ~9pm UTC, even though there's no real incident.
- Investigation: Examine the metric pattern over 4 weeks — Sunday 9pm shows consistent low-traffic valley (developers return from weekends, US time zone). The model's rolling window doesn't have enough history to learn the weekly pattern and flags the valley as abnormal.
- Root cause: Statistical anomaly detection with a 60-minute rolling window can't learn weekly seasonality.
- Fix option 1: Switch to Prophet or add day-of-week as a conditioning feature. Compute day-specific baselines (separate baseline for Sunday vs. Wednesday).
- Fix option 2: Add minimum duration requirement — an anomaly must persist for 10+ minutes to alert. Valleys that last only 30 minutes get suppressed.
- Fix option 3: Lower the rolling window to capture more historical context: increase from 60min to 10080min (1 week).
🎯 Interview Questions
Beginner
Static thresholds use fixed values (CPU > 80% = alert) that don't adapt to context. Anomaly detection learns what "normal" looks like for each service at each time of day and flags deviations from the learned baseline. Static thresholds produce false positives during expected load spikes and miss slow gradual degradation. Anomaly detection adapts dynamically to changing traffic patterns.
A z-score measures how many standard deviations a value is from the mean: z = (x - mean) / std. For anomaly detection: compute rolling mean and standard deviation over a historical window. If the current value has z-score > 3 (more than 3 standard deviations from recent average), it's likely anomalous. Z-score is fast and explainable but assumes data is roughly normally distributed.
SLO breaches mean users are already experiencing degradation. Early anomaly detection catches the leading indicators — rising latency, increasing error rates, memory leak trends — before they cross SLO thresholds. This gives engineers time to investigate and potentially prevent the breach. By the time an SLO alert fires, damage is already done.
Seasonal anomaly detection accounts for predictable cyclical patterns in metrics — daily traffic cycles (high daytime, low overnight), weekly patterns (lower on weekends), and annual patterns (Black Friday). A metric that is 60% CPU at 9am Monday is normal; 60% at 3am Sunday may be anomalous. Seasonal models compute expected values for each time slot and flag deviations from the time-specific baseline.
Contamination is the expected fraction of anomalous data points in the training set — it sets the decision threshold. Set it to match your expected anomaly rate in production. Too low (0.001): model misses real anomalies. Too high (0.1): model flags normal points as anomalous. Typical production value is 0.01-0.03 (1-3% anomaly rate). Validate by running the trained model against a labeled dataset.
Intermediate
Concept drift occurs when normal behaviour changes (new feature ships, traffic scales up). Detect it by monitoring: 1) Model anomaly rate — if > 5% of recent points are flagged as anomalous by a well-tuned model, normal may have changed. 2) Feature distribution drift via PSI. Strategy: use sliding window retraining (retrain weekly on last 30 days of data). Use changepoint detection (e.g., PELT algorithm) to detect when the baseline shifts and trigger automatic retraining.
In practice they're often used interchangeably, but technically: outlier detection identifies individual data points that differ from the population (one extreme CPU reading). Anomaly detection considers temporal context — a sequence of readings that are individually normal but together represent an unusual pattern (gradual memory leak across 2 hours). For AIOps, temporal anomaly detection is more useful because most production degradation patterns develop over time, not as single point spikes.
Maintain a maintenance/event calendar that the anomaly detector consults before firing. Integrate with your deployment system (GitHub Actions webhook) and change management calendar. When a batch job runs 2am-4am every night, add a suppression window for those time slots for relevant metrics (CPU, disk I/O). Build a "known events" API in your monitoring stack that anomaly systems check before alerting. This dramatically reduces Sunday-night false positives and deployment-correlated alerts.
Scenario-based
1) Integrate with your campaign calendar — before launch day, ingest the expected traffic multiplier. 2) Widen the baseline bounds during campaign windows (mean + 3×std × traffic_multiplier). 3) Use rate-of-change features: a campaign causes a gradual build-up (expected); an incident often causes a sudden spike within minutes. Detect shape of spike, not just magnitude. 4) After 3-4 campaign events, retrain the model with campaign data labeled "normal" so it learns the pattern.
Check service dependency graph: do these services share infrastructure, database, or network path? Compute cross-correlation on their anomaly windows — do both spike within the same 5-minute window? If service A's anomaly precedes service B's by 2-3 minutes, and A calls B, then A's failure likely caused B's — it's a cascading failure, not coincidence. Use distributed tracing to confirm. Alert as a single grouped incident, not two separate ones.
Frame it as a "new threat" problem, like a spam filter: it catches 98% of spam it has seen before, but a brand-new spam technique sneaks through until the filter is retrained. Similarly, our anomaly model learns from historical incidents — if a new type of failure hasn't happened before, the model won't recognise it as anomalous. We treat these as opportunities: when we catch a novel issue manually, we add it to the training data so the model learns to catch it next time. Over months, the model gets smarter about our specific system.
🌐 Real-world Usage
Azure Monitor uses per-resource dynamic thresholds — it learns a 4-week baseline for each metric on each specific resource, accounting for time-of-day and day-of-week patterns. Netflix uses a multi-layer anomaly detection stack: statistical models for real-time detection (low latency), ML models for seasonal pattern anomalies (hourly retrain), and LLM-based analysis for complex correlated failures. Dynatrace Davis AI uses causal AI to not just detect anomalies but automatically determine which anomaly caused which downstream impact.
📝 Summary
Effective production anomaly detection combines multiple approaches: statistical methods for speed and explainability, time-series models for seasonality awareness, and ML models for multi-variate correlation. Always use ensemble voting (2+ methods) and integrate a maintenance calendar to suppress expected spikes. The goal isn't zero false positives — it's raising the signal-to-noise ratio so engineers trust and act on alerts.