IntermediateLesson 6 of 9

Alerts and Notifications

Configure Grafana alert rules, contact points, and notification policies for actionable incident response.

Simple Explanation (ELI5)

Alerts watch metrics and notify your team when values cross dangerous limits.

Technical Explanation

Grafana unified alerting evaluates rules and routes notifications using contact points and policies. Good alerts use stable queries, meaningful thresholds, and suppression logic to avoid noise.

Visual Section

Metric Query
Alert Rule Evaluation
Contact Point (Slack/Email/Pager)

Hands-on Commands

promql
# CPU alert query
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod) > 0.8

# Memory alert query
sum(container_memory_working_set_bytes{container!=""}) by (pod) > 1.5e+09

Debugging Scenarios

Real-world Use Case

A team uses Grafana alerts to detect CPU saturation and memory pressure for Kubernetes workloads and routes critical alerts to PagerDuty.

Interview Questions

Beginner

What is a Grafana alert rule?

A condition evaluated periodically to determine if an alert should fire.

What is a contact point?

A configured notification destination like Slack or email.

Why use pending duration?

To avoid alerting on brief spikes.

What is notification policy?

Routing logic defining which alerts go to which contact points.

Can Grafana alert on Prometheus data?

Yes, via datasource queries.

Intermediate

How reduce alert noise?

Use better thresholds, grouping, and pending windows.

Difference between warning and critical alerts?

Warning is early signal, critical implies immediate impact.

How test alert reliability?

Use synthetic load and verify trigger and notification path.

Why include runbook links in alerts?

Speeds resolution by giving responders clear next steps.

How route by environment?

Use labels and policy matchers for prod/stage/dev.

Scenario-based

CPU alerts fire during deploys only. Fix?

Add pending duration and deploy-window suppression.

Alert fires but Slack gets nothing. Checks?

Contact point secret, webhook URL, and policy matching labels.

How avoid paging on single pod blips?

Aggregate by workload and require sustained breach.

Memory alerts noisy in batch jobs. Why?

Thresholds not workload-aware; use job-specific rules.

How design request failure alert?

Alert on error rate percentage over window with service label.

Summary

Effective Grafana alerting combines clean queries, sane thresholds, and reliable routing so teams respond only to real issues.