Hands-onLesson 8 of 9

Troubleshooting

Diagnose and fix the most common Splunk problems — no data indexed, forwarder failures, license warnings, slow searches, and field extraction issues.

Simple Explanation (ELI5)

When something goes wrong in Splunk, there's always a trace. Splunk logs about itself (internal indexes), forwarders have their own logs, and indexers track every piece of data they process. Troubleshooting in Splunk means following the data trail from source to search result.

Troubleshooting Framework

Approach every Splunk problem by tracing the data path from source to search: App/Server → Forwarder → Network → Indexer Pipeline → Index → Search Head → Results. Identify at which stage data is missing or incorrect.

Issue 1: No Data in Search Results

spl — Diagnose missing data
# Step 1: Verify data exists in the index (without filters)
index=prod_app | head 5 | table _time host sourcetype _raw

# Step 2: Check index list — is the index correct?
| rest /services/data/indexes | table title currentDBSizeMB totalEventCount

# Step 3: Check forwarder connection status from search head
index=_internal sourcetype=splunkd source=*metrics.log*
component=Metrics group=tcpin_connections
| stats count by hostname, sourceIp

# Step 4: Check indexing throughput per host
index=_internal source=*metrics.log* group=per_host_thruput
| timechart span=5m avg(kbps) by hostname

# Step 5: Verify data is being received at the indexer
index=_internal source=*splunkd.log* connection_type=cooked
| stats count by host

Issue 2: Universal Forwarder Not Sending Data

cli — Forwarder diagnostics
# On the forwarder host — check monitored files
$SPLUNK_HOME/bin/splunk list monitor

# Check forwarder connection status
$SPLUNK_HOME/bin/splunk list forward-server

# Validate outputs.conf is correct
$SPLUNK_HOME/bin/splunk btool outputs list --debug

# Check inputs.conf is parsing correctly
$SPLUNK_HOME/bin/splunk btool inputs list --debug

# Restart the forwarder
$SPLUNK_HOME/bin/splunk restart

# Check forwarder log for errors
tail -f $SPLUNK_HOME/var/log/splunk/splunkd.log | grep -i error

Issue 3: License Warning / Quota Exceeded

spl — License investigation
# Find top data contributors (by sourcetype)
index=_internal source=*license_usage.log* earliest=-1d
| stats sum(b) AS bytes by st
| eval MB=round(bytes/1024/1024,2)
| sort - MB
| head 20
| rename st AS sourcetype

# Check daily ingestion trend
index=_internal source=*license_usage.log*
| timechart span=1d sum(b) AS daily_bytes
| eval GB=round(daily_bytes/1024/1024/1024,3)

# Identify which host is driving volume
index=_internal source=*license_usage.log* earliest=-1d
| stats sum(b) AS bytes by h
| eval MB=round(bytes/1024/1024,2)
| sort - MB | head 10
| rename h AS host

Issue 4: Slow Search Performance

spl — Performance diagnosis
# Check job inspector — click search ID in Activity → Jobs
# Key metrics: Scan count, Event count, Run time

# Use tstats for metadata-only queries (10-100x faster)
| tstats count WHERE index=prod_app by sourcetype, host

# Check search concurrency — too many simultaneous searches?
index=_internal sourcetype=scheduler
| timechart span=5m count AS concurrent_searches

# Find long-running scheduled searches
index=_internal sourcetype=scheduler status=completed
| stats avg(run_time) AS avg_runtime by saved_search_name
| sort - avg_runtime | head 10

# Check indexer search load
index=_internal sourcetype=splunkd component=SearchOperator
| stats count by host
| sort - count

Issue 5: Field Not Extracted

spl + conf — Field extraction troubleshooting
# Step 1: Verify the raw event has the expected field
index=prod_app | head 5 | table _raw

# Step 2: Test rex extraction inline
index=prod_app
| rex field=_raw "duration=(?P<dur_ms>\d+)"
| table _time dur_ms _raw

# Step 3: Check what sourcetype is being applied
index=prod_app | head 10 | table sourcetype _raw

# If sourcetype is wrong — fix in inputs.conf:
[monitor:///var/log/app/app.log]
sourcetype = my_custom_app
index = prod_app

# In props.conf — define extraction:
[my_custom_app]
EXTRACT-duration = duration=(?P<duration_ms>\d+)
TIME_FORMAT = %Y-%m-%dT%H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 25

# Validate btool shows the new extraction
$SPLUNK_HOME/bin/splunk btool props list my_custom_app --debug

Issue 6: Indexer Not Receiving Data

spl + cli — Indexer connectivity
# Check if indexer listener is active (on indexer host)
$SPLUNK_HOME/bin/splunk list inputstatus

# Check network connectivity to indexer port 9997
telnet indexer.company.com 9997

# Check indexer receiving log
index=_internal source=*splunkd.log* host=indexer01 "Message from"
| head 20 | table _time source _raw

# Check for parsing errors on indexer
index=_internal source=*splunkd.log* log_level=ERROR host=indexer01
| head 20 | table _time _raw

Troubleshooting Checklist

SymptomFirst CheckSPL/Command
No results in searchIs the index correct?index=X | head 5
Forwarder silentoutputs.conf / connectivitysplunk list forward-server
License warningTop sourcetype volumeindex=_internal source=*license_usage*
Search very slowJob Inspector scan countClick job ID → Job Inspector
Field missingsourcetype assignmentindex=X | head 1 | table sourcetype _raw
Wrong timestampsTIME_FORMAT in props.confsplunk btool props list sourcetype --debug

Debugging Scenarios

Real-world Use Case

A production team reported "no logs for the last 2 hours." The SRE used index=_internal source=*metrics.log* group=tcpin_connections to verify the specific forwarder was not connecting. They then checked splunk list forward-server on the host and found the outputs.conf was pointing to a decommissioned indexer IP. They updated outputs.conf, restarted the forwarder, and logs began flowing again within 60 seconds. Total resolution time: 8 minutes.

Interview Questions

Beginner

Where does Splunk store its own operational logs?

In $SPLUNK_HOME/var/log/splunk/splunkd.log locally, and indexed into the _internal index for searchable access via SPL.

What is the _internal index used for?

It stores Splunk's own operational metrics and events — forwarder connections, indexing throughput, license usage, search job activity.

How do you check if a forwarder is connected?

Run splunk list forward-server on the forwarder, or search index=_internal sourcetype=splunkd group=tcpin_connections on the indexer.

What does btool do?

A Splunk CLI tool that validates and displays the merged effective configuration from all conf files, showing which settings apply to a given stanza.

What do you check first when a search returns no results?

Verify the index name is correct, the time range is appropriate, and data is actually indexed with index=X | head 5.

Intermediate

How do you investigate a license warning?

Query index=_internal source=*license_usage.log* | stats sum(b) by st to find which sourcetypes are consuming the most license volume.

How do you diagnose a slow search?

Use the Job Inspector to compare scan count vs event count. High ratio means poor index selectivity — add index/sourcetype filters and narrow the time range.

Data comes in but timestamps are wrong. How do you fix it?

Add TIME_FORMAT and TIME_PREFIX to props.conf for the sourcetype. Validate with btool. May need to re-index affected data if historical records have wrong timestamps.

Field extraction was working, then stopped. What likely changed?

The app changed its log format — the regex no longer matches. Run | head 5 | table _raw to inspect current format and update the EXTRACT regex accordingly.

How do you validate conf file changes without restarting Splunk?

Use splunk btool props list <sourcetype> --debug to verify the merged effective config. Props changes take effect for new data without restart; transforms changes may require restart.

Scenario-based

No logs from a specific server for the last 30 minutes. Your diagnosis steps?

1. Check index=_internal tcpin_connections | stats by hostname — is the forwarder connected? 2. SSH to server, check if UF is running and log file is active. 3. Check splunk list monitor on forwarder. 4. Check network to indexer port 9997.

License is at 100%. You need to reduce volume immediately. How?

Find top sourcetypes causing volume. Apply NULLQUEUE transforms to drop DEBUG logs from the top contributor at the Heavy Forwarder. This reduces indexed volume without affecting service logs.

Search returns events but they all show the same timestamp. Why?

The timestamp isn't being extracted from the event properly — Splunk uses the indexing time as the event timestamp. Fix TIME_FORMAT and TIME_PREFIX in props.conf for the sourcetype.

A critical dashboard stopped showing data for the last hour only. All other dashboards are fine. Why?

Check if the specific index used by this dashboard has data: index=X earliest=-2h | timechart span=10m count — find the gap. Then check forwarder connectivity for that index's sources.

A new app was deployed and logs are in a new format. How do you set up proper parsing?

Set explicit sourcetype in inputs.conf, add timestamp extraction to props.conf, build EXTRACT regex in transforms.conf, validate with btool, test inline with rex, then deploy the configuration app via Deployment Server.

Summary

Systematic troubleshooting in Splunk follows the data path: source → forwarder → indexer → search. The _internal index is your best friend — it contains metrics, connection status, and error events for every Splunk component. Combine btool for config validation with SPL queries against _internal for operational diagnosis.