AdvancedLesson 10 of 16

AI-Powered Monitoring with Prometheus and Azure Monitor

Extend standard monitoring stacks with anomaly detection, forecasting, and AI-based alert enrichment.

🧒 Simple Explanation (ELI5)

Prometheus and Azure Monitor tell you what is happening. AI helps tell you whether it is unusual, why it matters, and what to check next.

🤔 Why Do We Need It?

Threshold-only monitoring misses slow drifts and unusual patterns.
Metrics alone do not explain incident context clearly.
Different services have different normal behavior by time of day or release cycle.
Forecasting helps teams act before saturation or failure.

🌍 Real-world Analogy

A speedometer tells you current speed. A smart driving assistant compares that speed with road, weather, and traffic conditions and tells you whether you are actually at risk.

⚙️ Technical Explanation

AI-powered monitoring typically layers on top of an existing metric and alert stack. Prometheus handles scraping and querying. Azure Monitor handles storage, dashboards, and alerting. The AI layer adds dynamic thresholds, multivariate anomaly detection, time-series forecasting, event correlation, and natural-language summaries.

📊 Visual: AI Monitoring Pipeline in AKS

Input → AI → Action: Prometheus + Azure Monitor in AKS

🐳 AKS Pods
CPU / Mem / Latency

→

📊 Prometheus Scrape
PromQL feature extraction

→

🤖 AI Anomaly Model
z-score + time-series

→

☁️ Azure Monitor
Smart Detection + Forecast

→

🚨 Smart Alert
enriched + context-rich

⌨️ Commands and Queries

promql

# ── Z-score feature: CPU deviation from 1-hour rolling baseline (AI anomaly detection input) ──
(
  rate(container_cpu_usage_seconds_total{namespace="prod"}[5m]) * 100
  - avg_over_time(rate(container_cpu_usage_seconds_total{namespace="prod"}[5m])[1h:5m]) * 100
) / (stddev_over_time(rate(container_cpu_usage_seconds_total{namespace="prod"}[5m])[1h:5m]) + 0.001)

# ── Alert: P99 latency > 2× its own 6-hour rolling average (dynamic threshold) ──
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="checkout-api"}[5m]))
  > 2 * avg_over_time(
      histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{job="checkout-api"}[5m]))[6h:5m]
    )

# ── Capacity forecast: disk full in N seconds (predict_linear) ──
# Fires a warning when predicted to fill within 7 days
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[48h], 7 * 24 * 3600) < 0

kusto

// ── Azure Monitor KQL: Z-score CPU anomaly across AKS pods ──
let baseline_days = 7d;
let current_window = 5m;
Perf
| where ObjectName == "Container" and CounterName == "% Processor Time Usage"
| where TimeGenerated > ago(baseline_days)
| summarize mean_cpu = avg(CounterValue), std_cpu = stdev(CounterValue) by Computer
| join kind=inner (
    Perf
    | where ObjectName == "Container" and CounterName == "% Processor Time Usage"
    | where TimeGenerated > ago(current_window)
    | summarize current_cpu = avg(CounterValue) by Computer
  ) on Computer
| extend z_score = (current_cpu - mean_cpu) / (std_cpu + 0.001)
| where z_score > 2.5
| project Computer, current_cpu, mean_cpu, z_score
| order by z_score desc

bash

# ── Kubernetes: Get raw resource usage (AI model input) ──
kubectl top pods -n prod --sort-by=cpu
kubectl top nodes

# ── Azure CLI: Query metric and pipe to anomaly script ──
az monitor metrics list \
  --resource "/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.ContainerService/managedClusters/$AKS" \
  --metric "node_cpu_usage_percentage" \
  --interval PT5M --output json \
  | python3 anomaly_flag.py

🧪 Hands-on

Select one metric with strong daily seasonality, such as request rate or CPU usage.
Build a baseline using the last 7-14 days.
Send current values plus baseline deltas into an anomaly detector.
Write the AI output back into an enriched alert payload.
Validate alerts against known incidents and known normal traffic spikes.

🧭 Example (Real-world Use Case)

A streaming platform uses Prometheus for application metrics and Azure Monitor for cloud metrics. AI compares current ingestion lag with historical weekday patterns and marks a 15% lag increase as normal during peak hours but flags a 4% increase at 3 a.m. as suspicious.

🛠️ Try It Yourself

🎮

Challenge: Build and Tune a Dynamic Threshold Alert in Prometheus

Run the z-score PromQL query above in your Prometheus UI or Grafana Explore. Replace payment-api with any service you have. Are there moments where z-score spikes to >±2.5? What was happening at those times?
Set the baseline window: Change [1h:5m] to [7d:5m] in the baseline formula. Compare the alert sensitivity. Which catches genuine incidents? Which produces fewer false positives on Monday mornings?
Build a capacity forecast alert: Run predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[48h], 7 * 24 * 3600) < 0 in Prometheus. Identify which nodes (if any) are projected to run out of disk space within 7 days.
KQL exercise: Run the z-score KQL query in your Azure Log Analytics workspace. Find the top 3 pods by z-score right now. Are they genuinely anomalous, or experiencing an expected batch job?
Alert fatigue audit: List your 5 noisiest current alerts. For each, write one sentence on what dynamic threshold logic would suppress the false positives. Share one of them with your team as a candidate for conversion.

🐛 Debugging Scenarios

False Positive: AI Fires P2 Alert Every Monday Morning

Signal: AlertManager fires "CPU anomaly" every Monday 08:00–09:30 UTC. On-call engineers acknowledge and immediately ignore it. Monday is now a "cry wolf" shift.

Root cause: The PromQL baseline window is 1 hour (avg_over_time(...[1h])). Sunday night is quiet; Monday morning is peak traffic. The hourly baseline from Sunday night makes Monday morning look like a 4× anomaly even though it is entirely predictable.
Fix: Expand baseline to avg_over_time(...[7d:5m]) (same weekday comparison). In KQL, add | where dayofweek(TimeGenerated) == dayofweek(now()) to compute a day-specific baseline. Alternatively use Facebook Prophet or Azure Monitor Smart Detection which have native weekly seasonality handling.
Verification: Backtest the new query against 4 weeks of data. The query should return 0 matches for Monday 08:00–09:30, and should still match last Tuesday's real DB spike.

Wrong Prediction: Disk Capacity Alert Fires But Disk Is Stable

Signal: predict_linear alert fires "disk full in 18 hours." Engineers investigate — disk is at 65% and has been stable for 3 weeks.

Root cause: predict_linear uses linear regression over the last 6 hours. A log rotation task ran at 22:00 last night, causing a 20-minute disk write spike. The regression extrapolated the spike as a permanent growth trend.
Fix: Extend lookback to 48h+: predict_linear(...[48h], 7 * 24 * 3600). Only alert if the slope is consistently positive over an extended window. Add deriv(node_filesystem_avail_bytes[6h]) < -threshold as a secondary confirmation signal.
Verification: Replay the rule against 30 days of data. Log rotation nights should not trigger it. A node with genuine 2% daily growth should trigger it 7 days before projected full.

🎯 Interview Questions

Beginner

What is the difference between regular monitoring and AI-powered monitoring?▾

Regular monitoring collects and visualizes signals. AI-powered monitoring adds interpretation such as anomaly detection, forecasting, and contextual prioritization.

Why are dynamic thresholds useful?▾

They adapt to changing behavior over time instead of treating every traffic pattern as if it should follow one fixed number.

What does Prometheus do in this design?▾

Prometheus scrapes and stores metrics and provides query capabilities that can feed AI analysis.

What does Azure Monitor add?▾

Azure Monitor adds cloud telemetry, dashboards, Kusto queries, and native alerting and visualization capabilities.

What is forecasting in monitoring?▾

Forecasting predicts likely future values, such as storage usage or queue depth, so teams can act before failure occurs.

Intermediate

Why is multivariate detection often better than one metric only?▾

Because incidents often show up across several signals together, like latency, error rate, and queue depth, rather than in one isolated metric.

How would you avoid false positives in AI monitoring?▾

I would model seasonality, sanitize noisy metrics, use minimum anomaly duration rules, and validate against historical incident windows.

What role does feature engineering play here?▾

Feature engineering turns raw telemetry into useful context such as deviation from baseline, rate of change, or deployment proximity.

How do you correlate Prometheus metrics with Azure Monitor signals?▾

Use shared dimensions like service name, cluster, namespace, region, or instance identity and normalize timestamps before analysis.

Why is explainability important for smart alerts?▾

Responders need to know why the alert fired, which baseline was violated, and what evidence supports the anomaly conclusion.

Scenario-based

Your AI monitoring works well in staging but poorly in production. What do you inspect?▾

I would inspect workload diversity, data volume differences, seasonality, and whether production has business-driven patterns missing from staging.

How would you use AI to improve a noisy CPU alert in AKS?▾

I would compare CPU with request rate, pod scaling, and historical baselines to distinguish healthy scaling events from abnormal CPU pressure.

A forecast says disk will fill in 18 hours, but operators do not trust it. How do you improve adoption?▾

Show forecast history accuracy, confidence bands, and links to the underlying metric trend instead of just surfacing a raw prediction.

Would you replace all classic alerts with AI alerts?▾

No. Hard safety thresholds still matter for deterministic failure conditions. AI is strongest as an augmentation layer, not a total replacement.

How do you handle alert disagreement between static and AI thresholds?▾

I would inspect the underlying data, compare confidence, and use disagreement as a calibration signal during rollout rather than forcing one source to win blindly.

📝 Summary

AI-powered monitoring works best when it strengthens the telemetry you already trust. Prometheus and Azure Monitor remain the data foundation; AI adds adaptive interpretation, forecasting, and better operational context.

PreviousIntegrating AI Automation with CI/CD Pipelines ← Back to Course NextSelf-Healing Infrastructure and Auto-Remediation