Alerts
Configure scheduled and real-time Splunk alerts with trigger conditions, throttling, and notification routing.
Simple Explanation (ELI5)
Alerts are Splunk watching your logs while you sleep. You tell Splunk: "run this search every 5 minutes and if you find more than 10 errors, send me an email (or page me)." Splunk keeps watch so you don't have to.
Technical Explanation
A Splunk alert is a saved search with a trigger condition and one or more alert actions. Splunk evaluates alerts on a schedule (every N minutes/hours) or in real time as events are indexed. Trigger conditions include: number of results exceeds threshold, field value comparisons, or custom condition expressions. Alert actions include: email, webhook, run script, PagerDuty/Slack via Splunk Add-ons, and creating Splunk incidents.
Alert Lifecycle
runs on schedule
condition
evaluated
check
(suppress duplicate)
(email, webhook,
PagerDuty)
Alert Types
Runs at fixed intervals (every 5 minutes, hourly). Best for threshold-based operational alerts with low latency tolerance ≥ 1 minute.
Evaluates trigger condition as each event is indexed. Use for zero-tolerance scenarios like critical security events. High CPU cost.
Scheduled alert using a rolling time window (e.g., earliest=-5m) that always looks back the same interval. Most common operational pattern.
Alert Configuration
# Alert: more than 50 errors in the last 5 minutes index=prod_app level=ERROR earliest=-5m | stats count AS error_count | where error_count > 50 # Alert: error rate exceeds 5% for any service index=prod_app earliest=-5m | stats count(eval(level="ERROR")) AS errors, count AS total by service | eval error_rate=round(errors/total*100, 2) | where error_rate > 5 # Alert: specific exception detected index=prod_app "NullPointerException" OR "OutOfMemoryError" earliest=-5m # Security alert: brute force (5+ failed logins in 10 minutes per user) index=auth action=login result=failure earliest=-10m | stats count by user | where count >= 5 # Alert: payment service response time p95 above 3 seconds index=prod_app service=payment-service earliest=-5m | eventstats perc95(duration_ms) AS p95_latency | where p95_latency > 3000 | dedup p95_latency
Webhook Alert Action
{
"text": "🚨 *Splunk Alert: High Error Rate*",
"attachments": [
{
"color": "#ff0000",
"fields": [
{"title": "Alert Name", "value": "$name$", "short": true},
{"title": "Trigger Count", "value": "$result.count$", "short": true},
{"title": "Search", "value": "$search_uri_absolute$", "short": false}
]
}
]
}Alert Throttling
| Setting | Description | Recommendation |
|---|---|---|
| Throttle window | Suppress re-trigger for N seconds | 300–900 seconds for operational alerts |
| Suppress fields | De-duplicate per field value (e.g., per host) | Set to the entity field (host, service) |
| Trigger once | Fire only for the first matching result | Use for security alerts where one event is enough |
Debugging Scenarios
- Alert not firing when it should: Run the search manually in the last window and verify it returns results. Check trigger condition threshold.
- Alert firing too frequently (noise): Increase throttle window; raise the threshold; use
| statsaggregation rather than per-event triggers. - Webhook action not executing: Check alert action audit log (
index=_audit action=alert_fired). Validate webhook URL and token in the action config. - Email not received: Verify SMTP settings in Splunk (Settings → Server Settings → Email Settings) and check spam folder.
- Real-time alert causing high CPU: Switch to scheduled alert with a 1-minute interval — real-time alerts consume significant indexer resources.
Real-world Use Case
An SRE team configured a 3-tier alerting strategy: (1) INFO-level scheduled alert every 15 minutes for trending anomalies, (2) WARNING alert every 5 minutes when error rate exceeded 2%, (3) CRITICAL real-time alert for OOM or DB connection pool exhaustion. Each tier routed to different channels — Slack for INFO, PagerDuty for WARNING and CRITICAL — with throttling preventing duplicate pages during sustained incidents.
Interview Questions
Beginner
A saved search that runs on a schedule or in real time and executes configured actions (email, webhook, etc.) when a trigger condition is met.
Scheduled alerts (run on a cron-like interval) and real-time alerts (trigger as events are indexed).
Suppressing repeated alert firings within a defined time window to prevent alert storms when a condition persists.
Email, webhook, run script, Slack/PagerDuty via Add-ons, and creating Splunk incidents.
Activity → Triggered Alerts, or search index=_audit action=alert_fired for the alert action audit trail.
Intermediate
Number of results fires when the search returns more (or fewer) than a count threshold. Custom condition evaluates an expression against the search result fields for richer logic.
They evaluate continuously as events are indexed, consuming persistent CPU on the search head. Use them only for zero-latency critical events.
Use stats to compute errors and total, then eval to calculate rate, then where to filter above threshold — SPL: | eval rate=errors/total*100 | where rate > 5.
In the throttle settings, set the suppress fields to host — this allows one alert per unique host within the throttle window.
A Splunk Enterprise Security concept — an alert that creates a structured incident record in the Incident Review dashboard for SOC analyst triage.
Scenario-based
index=prod_app level=ERROR earliest=-5m | stats count by service | where count > 100
Set a throttle window of 3600 seconds (1 hour) with field suppression on the entity that's repeating. Switch trigger condition to fire once per run.
Install the Splunk Add-on for PagerDuty, configure the integration key, then select the PagerDuty alert action in the saved search alert actions settings.
SPL: index=auth result=failure earliest=-5m | stats count by user | where count >= 10. Schedule every 5 minutes, trigger when results > 0, real-time for critical environments.
Use a scheduled alert on a short interval (1–5 minutes) with condition "number of results is 0" — this fires when expected data is absent, detecting monitoring blind spots.
Summary
Alerts close the loop between log data and operational response. Combining precise SPL triggers, appropriate throttling, and multi-channel routing transforms Splunk from a search tool into a proactive incident detection system.