CareerLesson 9 of 9

Interview Preparation

Targeted preparation for Splunk interviews in DevOps, SRE, SOC Analyst, and Platform Engineering roles.

Simple Explanation (ELI5)

Splunk interviews test whether you understand logs, can write SPL to find answers, and know how to set up and maintain Splunk in production. The goal is to demonstrate practical thinking, not just memorize features.

What Interviewers Test

Core Revision Topics

Architecture

UF, HF, Indexer, Search Head, Deployment Server. When to use each component.

Ingestion

inputs.conf, HEC, syslog. Sourcetype, index routing, pipeline stages.

SPL Essentials

stats, timechart, eval, rex, where, top, dedup, lookup, transaction, streamstats.

Dashboards

Simple XML, tokens, drilldowns, saved searches, scheduled reports.

Alerts

Scheduled vs real-time, trigger conditions, throttling, alert actions.

Troubleshooting

_internal index, btool, forwarder diagnostics, license_usage, Job Inspector.

Rapid-fire Questions

Fundamentals (Beginner)

What makes structured logging better than unstructured?

Structured logs (JSON/KV) have named, typed fields that tools extract automatically — no regex parsing required, making searching and dashboarding faster and more reliable.

What is the Splunk indexing pipeline?

The sequence: input → parsing (line-breaking, timestamping, source type assignment) → transforms (field extraction, routing, masking) → indexing into buckets.

What is a bucket in Splunk?

A time-bounded directory on the indexer containing compressed raw event data and index structures. Moves: hot → warm → cold → frozen (archived or deleted).

What is a sourcetype?

A label that identifies the data format — used by Splunk to apply the correct parsing rules, timestamp extraction, and field extractions.

What port does the Universal Forwarder send data on?

TCP port 9997 (Splunk-to-Splunk forwarding) — configurable in outputs.conf.

What is the difference between an index and a sourcetype?

Index is the storage repository (like a database). Sourcetype is the format label that controls how data is parsed, not where it's stored.

What is HEC?

HTTP Event Collector — allows applications to POST events directly to Splunk via HTTPS with a token, without needing a forwarder agent.

What does the Deployment Server do?

Centrally manages and distributes apps and configuration updates to Universal Forwarder fleets.

SPL (Intermediate)

Write an SPL query to count errors per service in the last hour.

index=prod_app level=ERROR earliest=-1h | stats count by service | sort - count

How do you compute an error rate percentage in SPL?

| stats count(eval(level="ERROR")) AS errors, count AS total by service | eval error_pct=round(errors/total*100,2)

What is the difference between stats and eventstats?

stats aggregates and collapses events into summary rows. eventstats computes statistics but adds results back to each original event — preserving all events for downstream commands.

When would you use transaction vs stats?

transaction groups related events into one (sessions, request-response correlations with variable boundaries). stats is faster for fixed aggregations. Prefer stats for performance.

What does lookup do?

Enriches events by joining additional fields from a CSV lookup table or KV store based on a matching field value — e.g., adding team owner based on hostname.

What is the difference between earliest and _time?

earliest is a search time modifier that limits which indexed events are scanned. _time is the event's extracted timestamp field, searchable after retrieval.

How do you find the most recent event per host?

index=prod_app | stats latest(_time) AS last_seen, latest(_raw) AS last_event by host | sort - last_seen

Architecture & Operations (Intermediate)

Heavy Forwarder vs Universal Forwarder — when do you need HF?

When you need to parse, filter, mask PII, or conditionally route events before indexing. UF sends raw data; HF processes it first.

What is an indexer cluster and what is RF?

Multiple indexers managed by a Cluster Manager for HA. RF (Replication Factor) is how many copies of each bucket are maintained across indexers — typically RF=2 or 3.

What happens when you exceed the Splunk daily license limit?

New searches are blocked (warning banner appears) but data continues to be indexed. Everything returns to normal at the license day rollover (midnight by default).

How do you reduce Splunk ingest volume without losing important data?

Apply NULLQUEUE transforms to drop noisy low-value logs (DEBUG, health checks) at the Heavy Forwarder before they reach the indexer.

What is SmartStore?

A Splunk feature storing warm/cold indexed data in object storage (S3/GCS/Azure Blob) with a local cache for hot data — reduces indexer disk costs significantly.

Scenario-based

There are no logs for 2 hours from a specific server. Walk me through your investigation.

1. Search _internal for forwarder connections: index=_internal tcpin_connections hostname=affected-host. 2. SSH to host, verify UF running: systemctl status SplunkForwarder. 3. Check splunk list monitor for active inputs. 4. Verify network to indexer port 9997. 5. Check splunkd.log for errors.

You need to build an on-call dashboard. What metrics do you include?

Error count per service (single value, color-coded), error rate over time (timechart), p95 latency per service, top error messages, and top affected users. All filterable by environment token.

Explain how you'd set up alerting for a security breach detection use case.

Scheduled alert every 5 minutes using SPL searching for 10+ failed logins per user, new admin account creation, or access outside business hours. Trigger once per run, throttle per user for 1 hour, route to PagerDuty via Splunk add-on.

How would you prove to management that Splunk has reduced incident MTTD?

Query incident tickets / change management data, correlate incident start time with Splunk alert firing time, compute average time from incident start to alert. Show before/after comparison with historical data.

A new team wants to use Splunk. What do you set up for them?

1. Dedicated index with team-specific retention policy. 2. Universal Forwarder deployment via Deployment Server. 3. Splunk role with access to their index only. 4. sourcetype and extraction configuration. 5. Starter dashboard with error rate, volume, and top error messages.

Mock Practical Round

  1. Write SPL to find all services with error rate above 2% in the last 30 minutes.
  2. Explain how you would detect a data ingestion gap (no events for 15 minutes from a specific host).
  3. Build a dashboard panel showing top 5 error messages with drill-through to raw events.
  4. Configure an alert for payment service response time p95 > 2 seconds — fire once, throttle 10 minutes.

Key SPL Cheatsheet

spl — Quick reference
# Error count by service
index=prod_app level=ERROR earliest=-1h | stats count by service | sort -count

# Error rate %
| stats count(eval(level="ERROR")) AS e, count AS t by service | eval rate=round(e/t*100,2)

# p95 latency by service
index=prod_app | stats perc95(duration_ms) AS p95 by service | sort -p95

# Top failed logins (brute force)
index=auth result=failure earliest=-10m | stats count by user | where count>=10

# License volume by sourcetype
index=_internal source=*license_usage* | stats sum(b) AS bytes by st | sort -bytes

# Forwarder not connected
index=_internal group=tcpin_connections | stats count by hostname | sort -count

# Timechart errors per service per 5 minutes
index=prod_app level=ERROR | timechart span=5m count by service

Summary

Strong Splunk interview performance comes from demonstrating that you think operationally — you know how logs flow, how to find answers in SPL, and how to maintain Splunk in a production environment. Theory is table stakes; practical scenarios are what separates strong candidates.