Hands-onLesson 14 of 16

Lab: Anomaly Detection with Azure Monitor

Build a practical anomaly workflow using Azure Monitor data, baselines, and enriched alerts for faster incident detection.

🧒 Simple Explanation (ELI5)

This lab teaches you to spot unusual behavior automatically instead of waiting for a metric to cross one fixed threshold.

🤔 Why Do We Need It?

🌍 Real-world Analogy

If your home electricity bill is always higher on weekends, a smart system learns that pattern. It only alerts when usage becomes strange for that expected weekend pattern.

⚙️ Technical Explanation

The lab uses a baseline window plus current metric values. The detector compares current values against recent normal behavior and emits an enriched alert with deviation, likely impact, and supporting context.

📊 Visual Representation

Anomaly Lab Flow
Azure Metrics
Baseline Builder
Detector
Enriched Alert

⌨️ Commands / Syntax

kusto
InsightsMetrics
| where Namespace == "container.azm.ms/kubestate"
| summarize avg(Val) by bin(TimeGenerated, 5m), Name
python
def is_anomalous(current, baseline_mean, baseline_std):
    upper = baseline_mean + (3 * baseline_std)
    return current > upper, upper

🧪 Hands-on

  1. Choose a metric such as request latency, CPU, or queue length.
  2. Collect 7 days of historical values in 5-minute buckets.
  3. Compute a simple moving baseline and deviation band.
  4. Run the detector against current data.
  5. Create an alert payload that includes baseline, current value, and percentage deviation.

🧭 Example (Real-world Use Case)

An internal API usually sees low traffic overnight. At 02:30, latency rises from 90ms to 280ms. A static 500ms threshold never fires, but the anomaly model catches the abnormal jump and triggers early investigation.

🛠️ Try It Yourself

🐛 Debugging Scenario

Problem: You get anomalies every morning when traffic ramps up.

🎯 Interview Questions

Beginner

What is a baseline in anomaly detection?

A baseline is the expected normal behavior of a metric over time.

Why are static thresholds limited?

They do not adapt to seasonality, traffic patterns, or service-specific behavior.

What is a simple way to detect anomalies?

Compare the current value against a historical mean and standard deviation or a dynamic band.

What does Azure Monitor provide in this lab?

It provides the telemetry source used to build baselines and evaluate current behavior.

Why enrich anomaly alerts?

Enrichment gives responders context such as deviation magnitude and recent changes so they can act faster.

Intermediate

How do you choose a detection window?

I choose a window that matches the service behavior and the response speed needed, balancing sensitivity and stability.

Why is seasonality important?

Because traffic and load often vary by hour, weekday, and business cycle, and ignoring that creates false positives.

What is the tradeoff between sensitivity and noise?

Higher sensitivity catches more true issues but also increases false positives, so calibration matters.

How do you validate anomaly models?

Backtest on historical incidents and known normal periods, then compare misses and false alarms.

Would you use one model for every service?

No. Different services have different traffic shapes, baselines, and business tolerance for noise.

Scenario-based

Your anomaly detector performs well for APIs but badly for batch jobs. Why?

Because batch jobs follow different execution patterns and need separate baselines and alert logic.

How would you reduce false positives without missing real incidents?

I would add contextual features, minimum duration, and multimetric confirmation rather than simply raising thresholds.

An anomaly fires, but users are unaffected. What do you do?

I treat it as calibration feedback and inspect whether the model flagged an interesting but non-actionable pattern.

Would you feed deployment events into the detector?

Yes. Deployment timing often explains behavior shifts and helps the model separate expected changes from suspicious ones.

How do you move this lab toward production?

Start with shadow alerts, tune against history, then phase into advisory and finally paging once the signal is trustworthy.

📝 Summary

This lab turns anomaly detection into a concrete monitoring pattern: baseline, compare, enrich, and alert with context rather than raw threshold crossings.