Alerting
Build alerts that matter using Prometheus rules and Alertmanager routing, while avoiding alert fatigue.
Simple Explanation (ELI5)
Prometheus watches your metrics and decides when something crosses a dangerous line. Alertmanager then decides who to notify, how often, and through which channel. One system detects the problem. The other handles communication.
Real-world Analogy
A smoke detector senses smoke, but the building control system decides which alarms to sound, which doors to unlock, and which responders to notify. Prometheus is the detector. Alertmanager is the response coordinator.
Technical Explanation
Alerting rules evaluate PromQL expressions at intervals. When conditions are true for a specified duration, Prometheus fires alerts. Alertmanager groups repeated alerts, deduplicates them, supports silences, and routes notifications based on labels like severity, team, and environment.
| Element | Purpose | Example |
|---|---|---|
| alert rule | Condition definition | CPU above 90% for 10 minutes |
| for | Minimum firing duration | Avoid transient spikes |
| labels | Routing metadata | severity=critical, team=platform |
| annotations | Human-readable context | Summary and runbook URL |
| Alertmanager route | Notification policy | Critical → PagerDuty, warning → Slack |
Visual Representation
Commands / Syntax
groups:
- name: cpu-alerts
rules:
- alert: HighCPUUsage
expr: avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.9
for: 10m
labels:
severity: critical
team: platform
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage has been above 90% for 10 minutes."route:
receiver: slack-default
routes:
- matchers:
- severity="critical"
receiver: pagerduty
- matchers:
- team="platform"
receiver: slack-platformExample (Real-world Use Case)
A production Kubernetes cluster pages the platform on-call only when API error rate or node memory pressure threatens user experience. Lower-severity alerts, like a warning threshold on CPU trend, go to Slack. Each alert includes a dashboard link and runbook URL.
Hands-on Section
- Create a high CPU alert with a
for: 10mguard. - Add annotations containing summary, description, and a runbook URL.
- Route critical alerts to one receiver and warnings to another.
- Silence one alert temporarily and observe Alertmanager behavior.
Try It Yourself
- Write an alert for pod restarts over the last 15 minutes.
- Design a warning and critical version of a memory alert.
- List two ways to reduce alert noise without hiding real incidents.
Debugging Scenarios
- If an alert never fires, confirm the PromQL expression returns results and the
forwindow is not longer than expected traffic spikes. - If alerts fire too often, examine label dimensions and grouping policy in Alertmanager.
- If Kubernetes alerts page every rollout, add suppression logic for draining nodes or expected restarts.
Interview Questions
Beginner
Prometheus decides when an alert condition is true. Alertmanager handles delivery, grouping, silencing, and routing.
for clause in alerts?It prevents alerts from firing on short-lived spikes and reduces noise.
Annotations are human-readable fields like summaries, descriptions, and runbook links.
A silence temporarily suppresses matching alerts in Alertmanager.
They provide routing context such as severity, environment, cluster, or owning team.
Intermediate
An alert is actionable when it reflects real impact and includes enough context to start response immediately.
Remove low-value alerts, add for clauses, group notifications, route correctly, and focus on symptoms that matter to users.
Error budget burn ties alerts to customer impact and service objectives, while raw CPU can be noisy and not always harmful.
Production alerts should page only for meaningful impact. Staging alerts are usually informational and routed to chat, not pagers.
Deduplication prevents multiple identical or equivalent alerts from spamming responders.
Scenario-based
I would confirm whether rollouts naturally spike CPU, then add a for clause, change thresholds, or shift to a user-impact signal like latency or error rate.
Alertmanager grouping or deduplication is likely misconfigured, or alerts are labeled too granularly.
I would alert on node memory availability and eviction-related signals, not only per-pod memory spikes, because node pressure impacts multiple workloads.
I inspect whether Prometheus fired the alert, whether labels matched Alertmanager routes, receiver health, and whether a silence suppressed it.
No. Only urgent, user-impacting, actionable alerts should page. Everything else should go to lower-noise channels.
Summary
Good alerting is not about more alerts. It is about fewer, sharper alerts tied to real service health. Prometheus defines conditions. Alertmanager makes sure the right humans hear about the right problem at the right time.