Searching & Querying (SPL)
Master the Search Processing Language — Splunk's pipeline-based query language for searching, transforming, and reporting on indexed data.
Simple Explanation (ELI5)
SPL is like asking questions to your logs: "Show me all the errors from the past hour", "How many logins failed per minute?", "Which server sent the most traffic?" SPL chains these questions together using a pipe symbol — each step refines the result further.
Technical Explanation
SPL (Search Processing Language) follows a pipeline model: raw events flow through a sequence of commands separated by the pipe character (|). The first command defines the event set (index, time range, keywords); subsequent commands transform, filter, and aggregate that set into results. SPL commands fall into three categories: streaming (operate per event), transforming (aggregate multiple events), and generating (produce new data).
SPL Pipeline Structure
index=* level=ERROR
where duration > 1000
stats count by host
table / chart
Core SPL Commands
# Basic keyword search with time range index=prod_app level=ERROR earliest=-1h latest=now # Filter by field value index=prod_app status_code=500 # Wildcard search index=prod_app "payment*timeout*" # Boolean operators index=prod_app (level=ERROR OR level=FATAL) host=app-server-01 # Exclude events index=prod_app level=ERROR NOT sourcetype=debug_noise
# Count events by field index=prod_app level=ERROR | stats count by host # Count by multiple fields index=prod_app | stats count by level, sourcetype | sort -count # Sum a numeric field index=prod_app level=ERROR | stats sum(duration_ms) as total_ms, avg(duration_ms) as avg_ms by service # timechart — errors over time index=prod_app level=ERROR | timechart span=5m count by host # dc (distinct count) index=prod_app | stats dc(user_id) as unique_users by service
# rex — extract fields with regex
index=prod_app sourcetype=app_logs
| rex field=_raw "duration=(?P<duration_ms>\d+)"
| stats avg(duration_ms) as avg_latency by service
# eval — create calculated fields
index=prod_app
| eval latency_bucket = case(
duration_ms < 100, "fast",
duration_ms < 500, "normal",
duration_ms < 2000, "slow",
true(), "critical"
)
| stats count by latency_bucket
# where — conditional filter
index=prod_app | where duration_ms > 2000
# rename and table
index=prod_app level=ERROR
| stats count as error_count by host, service
| rename host as Server, service as Service
| table Server, Service, error_count
| sort -error_count# dedup — remove duplicate events based on field index=prod_app | dedup user_id sortby -_time # top — most frequent values index=prod_app level=ERROR | top limit=10 host # rare — least frequent values index=prod_app | rare limit=5 status_code # lookup — enrich with external data index=prod_app | lookup server_inventory.csv host OUTPUT team, environment | stats count by team, environment # transaction — group related events index=prod_app | transaction user_id maxspan=30m maxpause=5m | where eventcount > 5 AND duration > 60
Search Optimization
- Always specify
index=— avoids scanning all indexes. - Add
sourcetype=to narrow down event sets early in the pipeline. - Use the earliest/latest time picker — never search All Time in production.
- Filter before transforming:
level=ERROR | stats ...notstats ... | where level=ERROR. - Use
tstatsfor extremely high-volume metadata queries (tsidx-based, much faster).
Debugging Scenarios
- Search returns zero results: Check index name, time range, and that the keyword is actually in the raw data (use
| head 10 | table _raw). - Field not extracted automatically: Verify sourcetype is correct; use
| rexas a temporary extraction or fix props/transforms for the sourcetype. - Search is slow: Add index/sourcetype filters, narrow time range, and consider summary indexing or tstats for aggregate dashboards.
- timechart shows gaps: Data was not indexed during that window — check forwarder health for that period.
Real-world Use Case
A DevOps team needed to identify which microservice was generating the most 500 errors during a degradation incident. One SPL query delivered the answer in 8 seconds across 200 GB of logs: index=prod_app status_code=500 earliest=-1h | stats count by service | sort -count | head 10. The top offending service was identified, rolled back, and the incident resolved within 12 minutes of detection.
Interview Questions
Beginner
A sequence of SPL commands chained with the pipe character (|) — each command processes the output of the previous one.
Aggregates events — counts, sums, averages, min/max — optionally grouped by field values.
stats produces a flat table; timechart automatically groups by time interval (span) and produces data suitable for line/bar charts.
On-the-fly field extraction using named capture groups in a regex — useful when automatic field extraction hasn't been configured.
Use earliest=-1h latest=now or select Last 60 Minutes in the time range picker.
Intermediate
Inline field=value filtering happens at the index-scan phase (fast). The where command evaluates expressions on already-retrieved events (slower, more flexible for computed fields).
When querying billions of events for simple aggregations over metadata fields (host, sourcetype, time) — tstats reads tsidx index files, not raw event data, so it is 10–100× faster.
Creates or modifies a field using expressions — math, string functions, conditional (case/if) logic. Runs per-event as a streaming command.
Groups events from the same session/user into a single event based on shared field values, with configurable time boundaries. Useful for session analysis.
Use the lookup command with a CSV lookup table or KV store definition to add fields from external data based on a matching field value.
Scenario-based
index=prod_app earliest=-1h | stats avg(duration_ms) as avg_latency by endpoint | sort -avg_latency | head 5
index=auth action=failed | bucket span=10m _time | stats count by _time, username | where count >= 5
Short-term: add | rex field=_raw "duration=(?P<duration>\d+)". Long-term: add a TRANSFORMS-based extraction to props.conf and transforms.conf for the sourcetype.
Add index/sourcetype filters, narrow time range, move to summary indexing (create a scheduled search that pre-aggregates), or rewrite with tstats if metadata fields are sufficient.
index=* level=ERROR | stats count by index | where index=prod_app OR index=staging_app | rename index as Environment or use an eval to label them and compare.
Summary
SPL mastery is the core skill for Splunk practitioners. Start with keyword searches, add field filters for precision, use stats/timechart for aggregation, and rex/eval for transformation. Optimization — narrowing index, sourcetype, and time range — is the difference between a 2-second and 30-second search.