Interview Preparation
Targeted preparation for Splunk interviews in DevOps, SRE, SOC Analyst, and Platform Engineering roles.
Simple Explanation (ELI5)
Splunk interviews test whether you understand logs, can write SPL to find answers, and know how to set up and maintain Splunk in production. The goal is to demonstrate practical thinking, not just memorize features.
What Interviewers Test
- Concepts: Architecture (forwarder/indexer/search head), ingestion pipeline, index management.
- SPL: Write queries from scratch for common scenarios — error analysis, latency investigation, security events.
- Operational: Troubleshoot no-data scenarios, manage license, configure alerts.
- Production patterns: Retention, HA clustering, performance tuning, log lifecycle.
Core Revision Topics
UF, HF, Indexer, Search Head, Deployment Server. When to use each component.
inputs.conf, HEC, syslog. Sourcetype, index routing, pipeline stages.
stats, timechart, eval, rex, where, top, dedup, lookup, transaction, streamstats.
Simple XML, tokens, drilldowns, saved searches, scheduled reports.
Scheduled vs real-time, trigger conditions, throttling, alert actions.
_internal index, btool, forwarder diagnostics, license_usage, Job Inspector.
Rapid-fire Questions
Fundamentals (Beginner)
Structured logs (JSON/KV) have named, typed fields that tools extract automatically — no regex parsing required, making searching and dashboarding faster and more reliable.
The sequence: input → parsing (line-breaking, timestamping, source type assignment) → transforms (field extraction, routing, masking) → indexing into buckets.
A time-bounded directory on the indexer containing compressed raw event data and index structures. Moves: hot → warm → cold → frozen (archived or deleted).
A label that identifies the data format — used by Splunk to apply the correct parsing rules, timestamp extraction, and field extractions.
TCP port 9997 (Splunk-to-Splunk forwarding) — configurable in outputs.conf.
Index is the storage repository (like a database). Sourcetype is the format label that controls how data is parsed, not where it's stored.
HTTP Event Collector — allows applications to POST events directly to Splunk via HTTPS with a token, without needing a forwarder agent.
Centrally manages and distributes apps and configuration updates to Universal Forwarder fleets.
SPL (Intermediate)
index=prod_app level=ERROR earliest=-1h | stats count by service | sort - count
| stats count(eval(level="ERROR")) AS errors, count AS total by service | eval error_pct=round(errors/total*100,2)
stats aggregates and collapses events into summary rows. eventstats computes statistics but adds results back to each original event — preserving all events for downstream commands.
transaction groups related events into one (sessions, request-response correlations with variable boundaries). stats is faster for fixed aggregations. Prefer stats for performance.
Enriches events by joining additional fields from a CSV lookup table or KV store based on a matching field value — e.g., adding team owner based on hostname.
earliest is a search time modifier that limits which indexed events are scanned. _time is the event's extracted timestamp field, searchable after retrieval.
index=prod_app | stats latest(_time) AS last_seen, latest(_raw) AS last_event by host | sort - last_seen
Architecture & Operations (Intermediate)
When you need to parse, filter, mask PII, or conditionally route events before indexing. UF sends raw data; HF processes it first.
Multiple indexers managed by a Cluster Manager for HA. RF (Replication Factor) is how many copies of each bucket are maintained across indexers — typically RF=2 or 3.
New searches are blocked (warning banner appears) but data continues to be indexed. Everything returns to normal at the license day rollover (midnight by default).
Apply NULLQUEUE transforms to drop noisy low-value logs (DEBUG, health checks) at the Heavy Forwarder before they reach the indexer.
A Splunk feature storing warm/cold indexed data in object storage (S3/GCS/Azure Blob) with a local cache for hot data — reduces indexer disk costs significantly.
Scenario-based
1. Search _internal for forwarder connections: index=_internal tcpin_connections hostname=affected-host. 2. SSH to host, verify UF running: systemctl status SplunkForwarder. 3. Check splunk list monitor for active inputs. 4. Verify network to indexer port 9997. 5. Check splunkd.log for errors.
Error count per service (single value, color-coded), error rate over time (timechart), p95 latency per service, top error messages, and top affected users. All filterable by environment token.
Scheduled alert every 5 minutes using SPL searching for 10+ failed logins per user, new admin account creation, or access outside business hours. Trigger once per run, throttle per user for 1 hour, route to PagerDuty via Splunk add-on.
Query incident tickets / change management data, correlate incident start time with Splunk alert firing time, compute average time from incident start to alert. Show before/after comparison with historical data.
1. Dedicated index with team-specific retention policy. 2. Universal Forwarder deployment via Deployment Server. 3. Splunk role with access to their index only. 4. sourcetype and extraction configuration. 5. Starter dashboard with error rate, volume, and top error messages.
Mock Practical Round
- Write SPL to find all services with error rate above 2% in the last 30 minutes.
- Explain how you would detect a data ingestion gap (no events for 15 minutes from a specific host).
- Build a dashboard panel showing top 5 error messages with drill-through to raw events.
- Configure an alert for payment service response time p95 > 2 seconds — fire once, throttle 10 minutes.
Key SPL Cheatsheet
# Error count by service index=prod_app level=ERROR earliest=-1h | stats count by service | sort -count # Error rate % | stats count(eval(level="ERROR")) AS e, count AS t by service | eval rate=round(e/t*100,2) # p95 latency by service index=prod_app | stats perc95(duration_ms) AS p95 by service | sort -p95 # Top failed logins (brute force) index=auth result=failure earliest=-10m | stats count by user | where count>=10 # License volume by sourcetype index=_internal source=*license_usage* | stats sum(b) AS bytes by st | sort -bytes # Forwarder not connected index=_internal group=tcpin_connections | stats count by hostname | sort -count # Timechart errors per service per 5 minutes index=prod_app level=ERROR | timechart span=5m count by service
Summary
Strong Splunk interview performance comes from demonstrating that you think operationally — you know how logs flow, how to find answers in SPL, and how to maintain Splunk in a production environment. Theory is table stakes; practical scenarios are what separates strong candidates.