AdvancedLesson 7 of 11

Alerting

Build alerts that matter using Prometheus rules and Alertmanager routing, while avoiding alert fatigue.

Simple Explanation (ELI5)

Prometheus watches your metrics and decides when something crosses a dangerous line. Alertmanager then decides who to notify, how often, and through which channel. One system detects the problem. The other handles communication.

Real-world Analogy

A smoke detector senses smoke, but the building control system decides which alarms to sound, which doors to unlock, and which responders to notify. Prometheus is the detector. Alertmanager is the response coordinator.

Technical Explanation

Alerting rules evaluate PromQL expressions at intervals. When conditions are true for a specified duration, Prometheus fires alerts. Alertmanager groups repeated alerts, deduplicates them, supports silences, and routes notifications based on labels like severity, team, and environment.

ElementPurposeExample
alert ruleCondition definitionCPU above 90% for 10 minutes
forMinimum firing durationAvoid transient spikes
labelsRouting metadataseverity=critical, team=platform
annotationsHuman-readable contextSummary and runbook URL
Alertmanager routeNotification policyCritical → PagerDuty, warning → Slack

Visual Representation

PromQL Rule
Prometheus Fires Alert
Alertmanager Routes / Groups / Silences

Commands / Syntax

yaml
groups:
- name: cpu-alerts
  rules:
  - alert: HighCPUUsage
    expr: avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.9
    for: 10m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "High CPU on {{ $labels.instance }}"
      description: "CPU usage has been above 90% for 10 minutes."
yaml
route:
  receiver: slack-default
  routes:
    - matchers:
        - severity="critical"
      receiver: pagerduty
    - matchers:
        - team="platform"
      receiver: slack-platform

Example (Real-world Use Case)

A production Kubernetes cluster pages the platform on-call only when API error rate or node memory pressure threatens user experience. Lower-severity alerts, like a warning threshold on CPU trend, go to Slack. Each alert includes a dashboard link and runbook URL.

Hands-on Section

  1. Create a high CPU alert with a for: 10m guard.
  2. Add annotations containing summary, description, and a runbook URL.
  3. Route critical alerts to one receiver and warnings to another.
  4. Silence one alert temporarily and observe Alertmanager behavior.

Try It Yourself

Debugging Scenarios

Interview Questions

Beginner

What is the difference between Prometheus alerting rules and Alertmanager?

Prometheus decides when an alert condition is true. Alertmanager handles delivery, grouping, silencing, and routing.

Why use the for clause in alerts?

It prevents alerts from firing on short-lived spikes and reduces noise.

What are annotations in an alert?

Annotations are human-readable fields like summaries, descriptions, and runbook links.

What is a silence?

A silence temporarily suppresses matching alerts in Alertmanager.

Why are labels useful in alerts?

They provide routing context such as severity, environment, cluster, or owning team.

Intermediate

What makes an alert actionable?

An alert is actionable when it reflects real impact and includes enough context to start response immediately.

How do you reduce alert fatigue?

Remove low-value alerts, add for clauses, group notifications, route correctly, and focus on symptoms that matter to users.

Why might you alert on error budget burn instead of raw CPU?

Error budget burn ties alerts to customer impact and service objectives, while raw CPU can be noisy and not always harmful.

How should alerts differ between staging and production?

Production alerts should page only for meaningful impact. Staging alerts are usually informational and routed to chat, not pagers.

What is alert deduplication?

Deduplication prevents multiple identical or equivalent alerts from spamming responders.

Scenario-based

CPU alerts fire every deployment. What would you change?

I would confirm whether rollouts naturally spike CPU, then add a for clause, change thresholds, or shift to a user-impact signal like latency or error rate.

A single outage generates 500 pages. Where is the problem?

Alertmanager grouping or deduplication is likely misconfigured, or alerts are labeled too granularly.

How would you alert on Kubernetes memory pressure at the node level?

I would alert on node memory availability and eviction-related signals, not only per-pod memory spikes, because node pressure impacts multiple workloads.

A critical alert fires but nobody gets notified. What do you inspect?

I inspect whether Prometheus fired the alert, whether labels matched Alertmanager routes, receiver health, and whether a silence suppressed it.

Would you route every production alert to PagerDuty?

No. Only urgent, user-impacting, actionable alerts should page. Everything else should go to lower-noise channels.

Summary

Good alerting is not about more alerts. It is about fewer, sharper alerts tied to real service health. Prometheus defines conditions. Alertmanager makes sure the right humans hear about the right problem at the right time.