Lab: Anomaly Detection with Azure Monitor
Build a practical anomaly workflow using Azure Monitor data, baselines, and enriched alerts for faster incident detection.
🧒 Simple Explanation (ELI5)
This lab teaches you to spot unusual behavior automatically instead of waiting for a metric to cross one fixed threshold.
🤔 Why Do We Need It?
- Static thresholds miss slow drifts and time-of-day patterns.
- Teams need early warning before users complain.
- Azure Monitor telemetry already contains signals that can power anomaly detection.
🌍 Real-world Analogy
If your home electricity bill is always higher on weekends, a smart system learns that pattern. It only alerts when usage becomes strange for that expected weekend pattern.
⚙️ Technical Explanation
The lab uses a baseline window plus current metric values. The detector compares current values against recent normal behavior and emits an enriched alert with deviation, likely impact, and supporting context.
📊 Visual Representation
⌨️ Commands / Syntax
InsightsMetrics | where Namespace == "container.azm.ms/kubestate" | summarize avg(Val) by bin(TimeGenerated, 5m), Name
def is_anomalous(current, baseline_mean, baseline_std):
upper = baseline_mean + (3 * baseline_std)
return current > upper, upper🧪 Hands-on
- Choose a metric such as request latency, CPU, or queue length.
- Collect 7 days of historical values in 5-minute buckets.
- Compute a simple moving baseline and deviation band.
- Run the detector against current data.
- Create an alert payload that includes baseline, current value, and percentage deviation.
🧭 Example (Real-world Use Case)
An internal API usually sees low traffic overnight. At 02:30, latency rises from 90ms to 280ms. A static 500ms threshold never fires, but the anomaly model catches the abnormal jump and triggers early investigation.
🛠️ Try It Yourself
- Try the same detector on a spiky metric and see where it breaks.
- Add deployment timestamps and observe whether they help explain anomalies.
- Experiment with 2-sigma vs 3-sigma detection.
🐛 Debugging Scenario
Problem: You get anomalies every morning when traffic ramps up.
- Check: whether the baseline models hourly or daily seasonality.
- Fix: compare current values to similar time windows from previous days instead of one global baseline.
- Fix: add minimum anomaly duration to prevent one-bucket spikes from paging.
🎯 Interview Questions
Beginner
A baseline is the expected normal behavior of a metric over time.
They do not adapt to seasonality, traffic patterns, or service-specific behavior.
Compare the current value against a historical mean and standard deviation or a dynamic band.
It provides the telemetry source used to build baselines and evaluate current behavior.
Enrichment gives responders context such as deviation magnitude and recent changes so they can act faster.
Intermediate
I choose a window that matches the service behavior and the response speed needed, balancing sensitivity and stability.
Because traffic and load often vary by hour, weekday, and business cycle, and ignoring that creates false positives.
Higher sensitivity catches more true issues but also increases false positives, so calibration matters.
Backtest on historical incidents and known normal periods, then compare misses and false alarms.
No. Different services have different traffic shapes, baselines, and business tolerance for noise.
Scenario-based
Because batch jobs follow different execution patterns and need separate baselines and alert logic.
I would add contextual features, minimum duration, and multimetric confirmation rather than simply raising thresholds.
I treat it as calibration feedback and inspect whether the model flagged an interesting but non-actionable pattern.
Yes. Deployment timing often explains behavior shifts and helps the model separate expected changes from suspicious ones.
Start with shadow alerts, tune against history, then phase into advisory and finally paging once the signal is trustworthy.
📝 Summary
This lab turns anomaly detection into a concrete monitoring pattern: baseline, compare, enrich, and alert with context rather than raw threshold crossings.