IntermediateLesson 4 of 9

Searching & Querying (SPL)

Master the Search Processing Language — Splunk's pipeline-based query language for searching, transforming, and reporting on indexed data.

Simple Explanation (ELI5)

SPL is like asking questions to your logs: "Show me all the errors from the past hour", "How many logins failed per minute?", "Which server sent the most traffic?" SPL chains these questions together using a pipe symbol — each step refines the result further.

Technical Explanation

SPL (Search Processing Language) follows a pipeline model: raw events flow through a sequence of commands separated by the pipe character (|). The first command defines the event set (index, time range, keywords); subsequent commands transform, filter, and aggregate that set into results. SPL commands fall into three categories: streaming (operate per event), transforming (aggregate multiple events), and generating (produce new data).

SPL Pipeline Structure

Search
index=* level=ERROR
|
Filter/Transform
where duration > 1000
|
Aggregate
stats count by host
|
Output
table / chart

Core SPL Commands

spl — basics
# Basic keyword search with time range
index=prod_app level=ERROR earliest=-1h latest=now

# Filter by field value
index=prod_app status_code=500

# Wildcard search
index=prod_app "payment*timeout*"

# Boolean operators
index=prod_app (level=ERROR OR level=FATAL) host=app-server-01

# Exclude events
index=prod_app level=ERROR NOT sourcetype=debug_noise
spl — stats and aggregation
# Count events by field
index=prod_app level=ERROR | stats count by host

# Count by multiple fields
index=prod_app | stats count by level, sourcetype | sort -count

# Sum a numeric field
index=prod_app level=ERROR | stats sum(duration_ms) as total_ms, avg(duration_ms) as avg_ms by service

# timechart — errors over time
index=prod_app level=ERROR | timechart span=5m count by host

# dc (distinct count)
index=prod_app | stats dc(user_id) as unique_users by service
spl — field extraction and transformation
# rex — extract fields with regex
index=prod_app sourcetype=app_logs
| rex field=_raw "duration=(?P<duration_ms>\d+)"
| stats avg(duration_ms) as avg_latency by service

# eval — create calculated fields
index=prod_app
| eval latency_bucket = case(
    duration_ms < 100, "fast",
    duration_ms < 500, "normal",
    duration_ms < 2000, "slow",
    true(), "critical"
  )
| stats count by latency_bucket

# where — conditional filter
index=prod_app | where duration_ms > 2000

# rename and table
index=prod_app level=ERROR
| stats count as error_count by host, service
| rename host as Server, service as Service
| table Server, Service, error_count
| sort -error_count
spl — advanced commands
# dedup — remove duplicate events based on field
index=prod_app | dedup user_id sortby -_time

# top — most frequent values
index=prod_app level=ERROR | top limit=10 host

# rare — least frequent values
index=prod_app | rare limit=5 status_code

# lookup — enrich with external data
index=prod_app
| lookup server_inventory.csv host OUTPUT team, environment
| stats count by team, environment

# transaction — group related events
index=prod_app
| transaction user_id maxspan=30m maxpause=5m
| where eventcount > 5 AND duration > 60

Search Optimization

Performance Tips
  • Always specify index= — avoids scanning all indexes.
  • Add sourcetype= to narrow down event sets early in the pipeline.
  • Use the earliest/latest time picker — never search All Time in production.
  • Filter before transforming: level=ERROR | stats ... not stats ... | where level=ERROR.
  • Use tstats for extremely high-volume metadata queries (tsidx-based, much faster).

Debugging Scenarios

Real-world Use Case

A DevOps team needed to identify which microservice was generating the most 500 errors during a degradation incident. One SPL query delivered the answer in 8 seconds across 200 GB of logs: index=prod_app status_code=500 earliest=-1h | stats count by service | sort -count | head 10. The top offending service was identified, rolled back, and the incident resolved within 12 minutes of detection.

Interview Questions

Beginner

What is the SPL pipeline?

A sequence of SPL commands chained with the pipe character (|) — each command processes the output of the previous one.

What does the stats command do?

Aggregates events — counts, sums, averages, min/max — optionally grouped by field values.

Difference between stats and timechart?

stats produces a flat table; timechart automatically groups by time interval (span) and produces data suitable for line/bar charts.

What is rex used for?

On-the-fly field extraction using named capture groups in a regex — useful when automatic field extraction hasn't been configured.

How do you restrict a search to the last hour?

Use earliest=-1h latest=now or select Last 60 Minutes in the time range picker.

Intermediate

What is the difference between where and search filtering?

Inline field=value filtering happens at the index-scan phase (fast). The where command evaluates expressions on already-retrieved events (slower, more flexible for computed fields).

When would you use tstats instead of stats?

When querying billions of events for simple aggregations over metadata fields (host, sourcetype, time) — tstats reads tsidx index files, not raw event data, so it is 10–100× faster.

What is the eval command?

Creates or modifies a field using expressions — math, string functions, conditional (case/if) logic. Runs per-event as a streaming command.

What is the transaction command?

Groups events from the same session/user into a single event based on shared field values, with configurable time boundaries. Useful for session analysis.

How do you enrich logs with external data in SPL?

Use the lookup command with a CSV lookup table or KV store definition to add fields from external data based on a matching field value.

Scenario-based

Write a query to find the top 5 slowest API endpoints in the last hour.

index=prod_app earliest=-1h | stats avg(duration_ms) as avg_latency by endpoint | sort -avg_latency | head 5

How would you detect accounts with 5+ failed logins in 10 minutes?

index=auth action=failed | bucket span=10m _time | stats count by _time, username | where count >= 5

A field called duration exists in raw text but isn't extracted. How do you fix it?

Short-term: add | rex field=_raw "duration=(?P<duration>\d+)". Long-term: add a TRANSFORMS-based extraction to props.conf and transforms.conf for the sourcetype.

Your dashboard panel is slow, taking 30+ seconds. How do you optimize it?

Add index/sourcetype filters, narrow time range, move to summary indexing (create a scheduled search that pre-aggregates), or rewrite with tstats if metadata fields are sufficient.

How would you compare error rates across two environments (prod and staging)?

index=* level=ERROR | stats count by index | where index=prod_app OR index=staging_app | rename index as Environment or use an eval to label them and compare.

Summary

SPL mastery is the core skill for Splunk practitioners. Start with keyword searches, add field filters for precision, use stats/timechart for aggregation, and rex/eval for transformation. Optimization — narrowing index, sourcetype, and time range — is the difference between a 2-second and 30-second search.