AdvancedLesson 7 of 11

Alerting

Build alerts that matter using Prometheus rules and Alertmanager routing, while avoiding alert fatigue.

Simple Explanation (ELI5)

Prometheus watches your metrics and decides when something crosses a dangerous line. Alertmanager then decides who to notify, how often, and through which channel. One system detects the problem. The other handles communication.

Real-world Analogy

A smoke detector senses smoke, but the building control system decides which alarms to sound, which doors to unlock, and which responders to notify. Prometheus is the detector. Alertmanager is the response coordinator.

Technical Explanation

Alerting rules evaluate PromQL expressions at intervals. When conditions are true for a specified duration, Prometheus fires alerts. Alertmanager groups repeated alerts, deduplicates them, supports silences, and routes notifications based on labels like severity, team, and environment.

Element	Purpose	Example
alert rule	Condition definition	CPU above 90% for 10 minutes
for	Minimum firing duration	Avoid transient spikes
labels	Routing metadata	severity=critical, team=platform
annotations	Human-readable context	Summary and runbook URL
Alertmanager route	Notification policy	Critical → PagerDuty, warning → Slack

Visual Representation

PromQL Rule

→

Prometheus Fires Alert

→

Alertmanager Routes / Groups / Silences

Commands / Syntax

yaml

groups:
- name: cpu-alerts
  rules:
  - alert: HighCPUUsage
    expr: avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.9
    for: 10m
    labels:
      severity: critical
      team: platform
    annotations:
      summary: "High CPU on {{ $labels.instance }}"
      description: "CPU usage has been above 90% for 10 minutes."

yaml

route:
  receiver: slack-default
  routes:
    - matchers:
        - severity="critical"
      receiver: pagerduty
    - matchers:
        - team="platform"
      receiver: slack-platform

Example (Real-world Use Case)

A production Kubernetes cluster pages the platform on-call only when API error rate or node memory pressure threatens user experience. Lower-severity alerts, like a warning threshold on CPU trend, go to Slack. Each alert includes a dashboard link and runbook URL.

Hands-on Section

Create a high CPU alert with a for: 10m guard.
Add annotations containing summary, description, and a runbook URL.
Route critical alerts to one receiver and warnings to another.
Silence one alert temporarily and observe Alertmanager behavior.

Try It Yourself

Write an alert for pod restarts over the last 15 minutes.
Design a warning and critical version of a memory alert.
List two ways to reduce alert noise without hiding real incidents.

Debugging Scenarios

If an alert never fires, confirm the PromQL expression returns results and the for window is not longer than expected traffic spikes.
If alerts fire too often, examine label dimensions and grouping policy in Alertmanager.
If Kubernetes alerts page every rollout, add suppression logic for draining nodes or expected restarts.

Interview Questions

Beginner

What is the difference between Prometheus alerting rules and Alertmanager?▾

Prometheus decides when an alert condition is true. Alertmanager handles delivery, grouping, silencing, and routing.

Why use the for clause in alerts?▾

It prevents alerts from firing on short-lived spikes and reduces noise.

What are annotations in an alert?▾

Annotations are human-readable fields like summaries, descriptions, and runbook links.

What is a silence?▾

A silence temporarily suppresses matching alerts in Alertmanager.

Why are labels useful in alerts?▾

They provide routing context such as severity, environment, cluster, or owning team.

Intermediate

What makes an alert actionable?▾

An alert is actionable when it reflects real impact and includes enough context to start response immediately.

How do you reduce alert fatigue?▾

Remove low-value alerts, add for clauses, group notifications, route correctly, and focus on symptoms that matter to users.

Why might you alert on error budget burn instead of raw CPU?▾

Error budget burn ties alerts to customer impact and service objectives, while raw CPU can be noisy and not always harmful.

How should alerts differ between staging and production?▾

Production alerts should page only for meaningful impact. Staging alerts are usually informational and routed to chat, not pagers.

What is alert deduplication?▾

Deduplication prevents multiple identical or equivalent alerts from spamming responders.

Scenario-based

CPU alerts fire every deployment. What would you change?▾

I would confirm whether rollouts naturally spike CPU, then add a for clause, change thresholds, or shift to a user-impact signal like latency or error rate.

A single outage generates 500 pages. Where is the problem?▾

Alertmanager grouping or deduplication is likely misconfigured, or alerts are labeled too granularly.

How would you alert on Kubernetes memory pressure at the node level?▾

I would alert on node memory availability and eviction-related signals, not only per-pod memory spikes, because node pressure impacts multiple workloads.

A critical alert fires but nobody gets notified. What do you inspect?▾

I inspect whether Prometheus fired the alert, whether labels matched Alertmanager routes, receiver health, and whether a silence suppressed it.

Would you route every production alert to PagerDuty?▾

No. Only urgent, user-impacting, actionable alerts should page. Everything else should go to lower-noise channels.

Summary

Good alerting is not about more alerts. It is about fewer, sharper alerts tied to real service health. Prometheus defines conditions. Alertmanager makes sure the right humans hear about the right problem at the right time.

PreviousQuerying (PromQL Basics)← Back to Course NextIntegration (Kubernetes, Apps)