AdvancedLesson 7 of 9

Real-world Scenarios

Apply SPL to real production challenges — HTTP error analysis, security investigation, latency troubleshooting, and SLA reporting.

Simple Explanation (ELI5)

This is where Splunk earns its cost. Instead of guessing why something broke, you have a query for each real situation — search, find the cause, fix the problem, and move on.

Technical Overview

Real-world Splunk use cases fall into four broad categories: application health (error rates, latency, service status), security operations (user behavior, threat detection, compliance), infrastructure monitoring (resource usage, capacity, network), and business analytics (transaction volumes, SLA compliance, customer impact). Each requires specific field knowledge and query patterns.

Scenario 1: HTTP Error Rate Analysis

Goal: Identify which endpoints and services are returning 5xx errors, at what rate, and whether the rate is increasing.

spl — HTTP Error Analysis
# Overall 5xx error rate in the last hour
index=web_access status>=500 earliest=-1h
| timechart span=5m count AS errors

# 5xx breakdown by endpoint
index=web_access status>=500 earliest=-1h
| stats count AS errors by uri_path, status
| sort - errors | head 20

# Error rate percentage per service
index=web_access earliest=-1h
| eval is_error=if(status>=500,1,0)
| stats sum(is_error) AS errors, count AS total by service
| eval error_pct=round(errors/total*100, 2)
| sort - error_pct
| where total > 100

# Detect error rate spike (compare last 5 min vs previous 5 min)
index=web_access status>=500
| bucket span=5m _time
| stats count AS errors by _time
| streamstats window=2 current=true last(errors) AS previous_period
| eval spike=if(errors > previous_period*2, "SPIKE", "normal")
| table _time errors previous_period spike

Scenario 2: Failed Login and Security Investigation

Goal: Detect brute-force attempts, account lockouts, and impossible travel patterns.

spl — Security Log Analysis
# Brute force detection: 10+ failed logins per user in 10 minutes
index=auth action=login result=failure earliest=-10m
| stats count by user, src_ip
| where count >= 10
| sort - count

# Account lockout events
index=wineventlog EventCode=4740 earliest=-1h
| table _time TargetUserName SubjectUserName IpAddress

# Failed logins followed by successful login (possible account takeover)
index=auth action=login earliest=-30m
| transaction user maxspan=30m
| where match(_raw, "result=failure") AND match(_raw, "result=success")
| table _time user src_ip

# Login from unusual country (using lookup)
index=auth action=login result=success earliest=-24h
| iplocation src_ip
| where Country != "United Kingdom"
| stats count by user, Country, src_ip
| sort - count

# Top 10 failed login users
index=auth action=login result=failure earliest=-1h
| stats count AS failed_attempts by user
| sort - failed_attempts | head 10

Scenario 3: Application Latency Troubleshooting

Goal: Identify slow endpoints, find latency outliers, and correlate latency with errors.

spl — Latency Analysis
# p50/p95/p99 latency per service
index=prod_app earliest=-1h
| stats perc50(duration_ms) AS p50, perc95(duration_ms) AS p95, perc99(duration_ms) AS p99 by service
| sort - p95

# Latency trend over time
index=prod_app service=payment-service earliest=-2h
| timechart span=5m avg(duration_ms) AS avg_latency, perc95(duration_ms) AS p95_latency

# Slow transactions - top 20 slowest individual records
index=prod_app earliest=-1h
| sort - duration_ms
| head 20
| table _time service endpoint user_id duration_ms message

# Correlation: latency increases when error rate increases
index=prod_app service=checkout-service earliest=-1h
| timechart span=5m avg(duration_ms) AS avg_latency, count(eval(level="ERROR")) AS errors
| rename avg_latency AS "Avg Latency (ms)", errors AS "Error Count"

# Services above SLA threshold (e.g., p95 > 500ms violates SLA)
index=prod_app earliest=-1h
| stats perc95(duration_ms) AS p95 by service
| where p95 > 500
| eval status="SLA_BREACH"
| table service p95 status

Scenario 4: SLA and Business Reporting

Goal: Measure availability, transaction success rates, and compliance against SLA thresholds.

spl — SLA Reporting
# Payment transaction success rate per day
index=prod_app sourcetype=payment_logs earliest=-7d
| eval outcome=if(status="success","success","failure")
| timechart span=1d count(eval(outcome="success")) AS successful,
            count(eval(outcome="failure")) AS failed
| eval success_rate=round(successful/(successful+failed)*100,3)

# Uptime calculation (availability based on error-free windows)
index=prod_app service=api-gateway earliest=-30d
| bucket _time span=5m
| stats count(eval(level="ERROR")) AS errors by _time
| eval status=if(errors=0, "up", "down")
| stats count(eval(status="up")) AS up_windows, count AS total_windows
| eval availability_pct=round(up_windows/total_windows*100,4)

# Top customer-impacting errors
index=prod_app level=ERROR earliest=-24h
| stats dc(user_id) AS affected_users, count AS occurrences by message, service
| sort - affected_users
| head 10

Scenario 5: Anomaly Detection with SPL

spl — Anomaly Detection
# Detect count anomaly using standard deviation
index=prod_app level=ERROR earliest=-24h
| timechart span=1h count AS hourly_errors
| eventstats avg(hourly_errors) AS mean_errors, stdev(hourly_errors) AS stddev_errors
| eval upper_bound=mean_errors+(stddev_errors*2)
| eval anomaly=if(hourly_errors > upper_bound, "ANOMALY", "normal")
| where anomaly="ANOMALY"

# Detect new hosts suddenly generating errors
index=prod_app level=ERROR earliest=-1h
| stats count by host
| join type=left host
    [search index=prod_app level=ERROR earliest=-2h latest=-1h | stats count AS prev_count by host]
| where isnull(prev_count)
| rename host AS new_error_host

Debugging Scenarios

Real-world Use Case

During a Black Friday traffic surge, a retail platform's checkout service degraded. The SRE on-call ran three queries in 90 seconds: (1) error rate per service — identified checkout-service at 12% error rate; (2) p95 latency — found payment gateway calls exceeding 8 seconds; (3) trace ID correlation — linked all slow calls to one database replica host. Root cause identified: replica lag. Traffic rerouted in 4 minutes, resolving the incident before brand damage occurred.

Interview Questions

Beginner

What is a log analysis scenario?

A specific investigation using SPL queries to answer a business or operational question from log data — e.g., why is the error rate high, who logged in from an unusual location.

What is p95 latency?

The 95th percentile of response times — 95% of requests are faster than this value. A useful SLO indicator that shows tail latency without being skewed by outliers.

What SPL command calculates percentiles?

stats perc95(fieldname) — compute the 95th percentile of a numeric field across events.

What is a brute-force attack in log context?

Many failed login attempts from the same IP or against the same user account in a short time window — detectable via stats count by user | where count > threshold.

What does dc() mean in SPL?

Distinct count — the number of unique values for the specified field across matching events.

Intermediate

How do you detect a sudden spike in errors using SPL?

Use timechart to bucket errors over time, then streamstats to compare current vs previous window. Flag when current is more than 2× previous as a spike.

How do you find which customers were affected by an outage?

Search for error events in the outage time window, then use stats dc(user_id) and values(user_id) to extract distinct affected customer IDs.

What is streamstats?

A running/rolling statistics command — calculates aggregate functions over a sliding window of events as they appear in the search results.

How do you calculate service availability from logs?

Bucket time into intervals, evaluate whether each window has errors or not, then compute up_windows/total_windows × 100 for availability percentage.

What is impossible travel detection?

Detecting when the same user account logs in from two geographically distant locations in a time window too short for physical travel — indicates credential compromise.

Scenario-based

An incident just started. You have 2 minutes to find the source. What are your first 3 SPL queries?

1. Error count by service for last 10 min. 2. Timechart errors by service to identify when degradation started. 3. Most common error messages for the highest-error service.

Security team asks: "Did anyone access the admin panel outside business hours last week?" Write the SPL.

index=web_access uri_path="/admin*" earliest=-7d | eval hour=strftime(_time,"%H") | where hour<8 OR hour>18 | table _time user src_ip uri_path

Services seem fine but users report slowness. How do you investigate with Splunk?

Query p95/p99 latency by service and endpoint. Calculate error-free slow responses (high duration but status 200). Correlate with external dependency calls.

Management needs a weekly report of top 5 customer-impacting bugs. How?

SPL: errors with dc(user_id) grouped by error message, sorted by affected user count. Schedule as a weekly report with email action delivering PDF/CSV.

How do you baseline normal behavior for anomaly detection?

Run a scheduled saved search over 7–30 days of historical data to compute mean and stdev. Use eventstats in future searches to compare current values against historical baseline.

Summary

Real-world Splunk value comes from having ready-made query patterns for each incident type. HTTP error analysis, security investigation, latency troubleshooting, and SLA reporting are the four workhorses — master each and you unlock Splunk's full operational potential.