Hands-onLesson 14 of 16

Lab: Anomaly Detection with Azure Monitor

Build a practical anomaly workflow using Azure Monitor data, baselines, and enriched alerts for faster incident detection.

🧒 Simple Explanation (ELI5)

This lab teaches you to spot unusual behavior automatically instead of waiting for a metric to cross one fixed threshold.

🤔 Why Do We Need It?

Static thresholds miss slow drifts and time-of-day patterns.
Teams need early warning before users complain.
Azure Monitor telemetry already contains signals that can power anomaly detection.

🌍 Real-world Analogy

If your home electricity bill is always higher on weekends, a smart system learns that pattern. It only alerts when usage becomes strange for that expected weekend pattern.

⚙️ Technical Explanation

The lab uses a baseline window plus current metric values. The detector compares current values against recent normal behavior and emits an enriched alert with deviation, likely impact, and supporting context.

📊 Visual Representation

Anomaly Lab Flow

Azure Metrics

→

Baseline Builder

→

Detector

→

Enriched Alert

⌨️ Commands / Syntax

kusto

InsightsMetrics
| where Namespace == "container.azm.ms/kubestate"
| summarize avg(Val) by bin(TimeGenerated, 5m), Name

python

def is_anomalous(current, baseline_mean, baseline_std):
    upper = baseline_mean + (3 * baseline_std)
    return current > upper, upper

🧪 Hands-on

Choose a metric such as request latency, CPU, or queue length.
Collect 7 days of historical values in 5-minute buckets.
Compute a simple moving baseline and deviation band.
Run the detector against current data.
Create an alert payload that includes baseline, current value, and percentage deviation.

🧭 Example (Real-world Use Case)

An internal API usually sees low traffic overnight. At 02:30, latency rises from 90ms to 280ms. A static 500ms threshold never fires, but the anomaly model catches the abnormal jump and triggers early investigation.

🛠️ Try It Yourself

Try the same detector on a spiky metric and see where it breaks.
Add deployment timestamps and observe whether they help explain anomalies.
Experiment with 2-sigma vs 3-sigma detection.

🐛 Debugging Scenario

Problem: You get anomalies every morning when traffic ramps up.

Check: whether the baseline models hourly or daily seasonality.
Fix: compare current values to similar time windows from previous days instead of one global baseline.
Fix: add minimum anomaly duration to prevent one-bucket spikes from paging.

🎯 Interview Questions

Beginner

What is a baseline in anomaly detection?▾

A baseline is the expected normal behavior of a metric over time.

Why are static thresholds limited?▾

They do not adapt to seasonality, traffic patterns, or service-specific behavior.

What is a simple way to detect anomalies?▾

Compare the current value against a historical mean and standard deviation or a dynamic band.

What does Azure Monitor provide in this lab?▾

It provides the telemetry source used to build baselines and evaluate current behavior.

Why enrich anomaly alerts?▾

Enrichment gives responders context such as deviation magnitude and recent changes so they can act faster.

Intermediate

How do you choose a detection window?▾

I choose a window that matches the service behavior and the response speed needed, balancing sensitivity and stability.

Why is seasonality important?▾

Because traffic and load often vary by hour, weekday, and business cycle, and ignoring that creates false positives.

What is the tradeoff between sensitivity and noise?▾

Higher sensitivity catches more true issues but also increases false positives, so calibration matters.

How do you validate anomaly models?▾

Backtest on historical incidents and known normal periods, then compare misses and false alarms.

Would you use one model for every service?▾

No. Different services have different traffic shapes, baselines, and business tolerance for noise.

Scenario-based

Your anomaly detector performs well for APIs but badly for batch jobs. Why?▾

Because batch jobs follow different execution patterns and need separate baselines and alert logic.

How would you reduce false positives without missing real incidents?▾

I would add contextual features, minimum duration, and multimetric confirmation rather than simply raising thresholds.

An anomaly fires, but users are unaffected. What do you do?▾

I treat it as calibration feedback and inspect whether the model flagged an interesting but non-actionable pattern.

Would you feed deployment events into the detector?▾

Yes. Deployment timing often explains behavior shifts and helps the model separate expected changes from suspicious ones.

How do you move this lab toward production?▾

Start with shadow alerts, tune against history, then phase into advisory and finally paging once the signal is trustworthy.

📝 Summary

This lab turns anomaly detection into a concrete monitoring pattern: baseline, compare, enrich, and alert with context rather than raw threshold crossings.

PreviousLab: Log Analysis Pipeline with Azure OpenAI ← Back to Course NextDebugging AI Automation Failures and False Alerts