IntermediateLesson 6 of 9

Alerts

Configure scheduled and real-time Splunk alerts with trigger conditions, throttling, and notification routing.

Simple Explanation (ELI5)

Alerts are Splunk watching your logs while you sleep. You tell Splunk: "run this search every 5 minutes and if you find more than 10 errors, send me an email (or page me)." Splunk keeps watch so you don't have to.

Technical Explanation

A Splunk alert is a saved search with a trigger condition and one or more alert actions. Splunk evaluates alerts on a schedule (every N minutes/hours) or in real time as events are indexed. Trigger conditions include: number of results exceeds threshold, field value comparisons, or custom condition expressions. Alert actions include: email, webhook, run script, PagerDuty/Slack via Splunk Add-ons, and creating Splunk incidents.

Alert Lifecycle

Saved Search
runs on schedule
Trigger
condition
evaluated
Throttle
check
(suppress duplicate)
Alert Actions
(email, webhook,
PagerDuty)

Alert Types

Scheduled Alert

Runs at fixed intervals (every 5 minutes, hourly). Best for threshold-based operational alerts with low latency tolerance ≥ 1 minute.

Real-time Alert

Evaluates trigger condition as each event is indexed. Use for zero-tolerance scenarios like critical security events. High CPU cost.

Rolling Window Alert

Scheduled alert using a rolling time window (e.g., earliest=-5m) that always looks back the same interval. Most common operational pattern.

Alert Configuration

spl — Alert search examples
# Alert: more than 50 errors in the last 5 minutes
index=prod_app level=ERROR earliest=-5m
| stats count AS error_count
| where error_count > 50

# Alert: error rate exceeds 5% for any service
index=prod_app earliest=-5m
| stats count(eval(level="ERROR")) AS errors, count AS total by service
| eval error_rate=round(errors/total*100, 2)
| where error_rate > 5

# Alert: specific exception detected
index=prod_app "NullPointerException" OR "OutOfMemoryError" earliest=-5m

# Security alert: brute force (5+ failed logins in 10 minutes per user)
index=auth action=login result=failure earliest=-10m
| stats count by user
| where count >= 5

# Alert: payment service response time p95 above 3 seconds
index=prod_app service=payment-service earliest=-5m
| eventstats perc95(duration_ms) AS p95_latency
| where p95_latency > 3000
| dedup p95_latency

Webhook Alert Action

json — Webhook payload (Slack example)
{
  "text": "🚨 *Splunk Alert: High Error Rate*",
  "attachments": [
    {
      "color": "#ff0000",
      "fields": [
        {"title": "Alert Name", "value": "$name$", "short": true},
        {"title": "Trigger Count", "value": "$result.count$", "short": true},
        {"title": "Search", "value": "$search_uri_absolute$", "short": false}
      ]
    }
  ]
}

Alert Throttling

✅ Throttling Best Practice
Always set throttle periods to prevent alert storms. If an alert fires at 09:00, suppress the same alert for the next 60 minutes even if the trigger condition remains true. Configure in: Saved Searches → Edit → Throttle.
SettingDescriptionRecommendation
Throttle windowSuppress re-trigger for N seconds300–900 seconds for operational alerts
Suppress fieldsDe-duplicate per field value (e.g., per host)Set to the entity field (host, service)
Trigger onceFire only for the first matching resultUse for security alerts where one event is enough

Debugging Scenarios

Real-world Use Case

An SRE team configured a 3-tier alerting strategy: (1) INFO-level scheduled alert every 15 minutes for trending anomalies, (2) WARNING alert every 5 minutes when error rate exceeded 2%, (3) CRITICAL real-time alert for OOM or DB connection pool exhaustion. Each tier routed to different channels — Slack for INFO, PagerDuty for WARNING and CRITICAL — with throttling preventing duplicate pages during sustained incidents.

Interview Questions

Beginner

What is a Splunk alert?

A saved search that runs on a schedule or in real time and executes configured actions (email, webhook, etc.) when a trigger condition is met.

What alert types does Splunk support?

Scheduled alerts (run on a cron-like interval) and real-time alerts (trigger as events are indexed).

What is alert throttling?

Suppressing repeated alert firings within a defined time window to prevent alert storms when a condition persists.

What alert actions does Splunk support?

Email, webhook, run script, Slack/PagerDuty via Add-ons, and creating Splunk incidents.

Where can you view fired alerts in Splunk?

Activity → Triggered Alerts, or search index=_audit action=alert_fired for the alert action audit trail.

Intermediate

What is the difference between trigger condition "number of results" and "custom condition"?

Number of results fires when the search returns more (or fewer) than a count threshold. Custom condition evaluates an expression against the search result fields for richer logic.

Why are real-time alerts expensive?

They evaluate continuously as events are indexed, consuming persistent CPU on the search head. Use them only for zero-latency critical events.

How do you alert on error rate rather than raw count?

Use stats to compute errors and total, then eval to calculate rate, then where to filter above threshold — SPL: | eval rate=errors/total*100 | where rate > 5.

How do you configure per-host alert suppression?

In the throttle settings, set the suppress fields to host — this allows one alert per unique host within the throttle window.

What is a notable event in Splunk?

A Splunk Enterprise Security concept — an alert that creates a structured incident record in the Incident Review dashboard for SOC analyst triage.

Scenario-based

You need to alert when any service has more than 100 errors in 5 minutes. Write the SPL.

index=prod_app level=ERROR earliest=-5m | stats count by service | where count > 100

Alert fires 50 times per incident. How do you fix it?

Set a throttle window of 3600 seconds (1 hour) with field suppression on the entity that's repeating. Switch trigger condition to fire once per run.

Alert needs to go to PagerDuty for on-call. How?

Install the Splunk Add-on for PagerDuty, configure the integration key, then select the PagerDuty alert action in the saved search alert actions settings.

You need to detect brute-force login attempts. Design the alert.

SPL: index=auth result=failure earliest=-5m | stats count by user | where count >= 10. Schedule every 5 minutes, trigger when results > 0, real-time for critical environments.

How would you ensure an alert fires even when Splunk has low event volume?

Use a scheduled alert on a short interval (1–5 minutes) with condition "number of results is 0" — this fires when expected data is absent, detecting monitoring blind spots.

Summary

Alerts close the loop between log data and operational response. Combining precise SPL triggers, appropriate throttling, and multi-channel routing transforms Splunk from a search tool into a proactive incident detection system.