Data Ingestion
Learn how logs flow from applications and servers into Splunk indexes using forwarders, HEC, and syslog.
Simple Explanation (ELI5)
Data ingestion is the delivery system that gets your logs into Splunk. Think of it as a postal service: your applications write letters (logs), and the forwarder is the postal worker that picks them up, drives them to the sorting facility (indexer), and places them in the right mailbox (index).
Technical Explanation
Splunk ingests data through multiple input methods. The most common for server logs is the Universal Forwarder monitoring file paths. For cloud-native and microservices workloads, the HTTP Event Collector (HEC) is preferred — apps POST JSON events directly over HTTPS. Syslog inputs, scripted inputs, and Splunk Add-ons (TAs) handle specialized sources.
Once data arrives at the indexer, the pipeline processes it: line breaking → timestamp extraction → source type recognition → field extraction → indexing into buckets.
Ingestion Methods
Monitors file paths and sends raw data. Configured via inputs.conf. Most common for server and application logs.
Full Splunk engine — parses, filters, masks sensitive data, and routes before indexing. Used for complex pipelines.
Apps send JSON events via HTTPS POST to a Splunk endpoint. No agent needed. Ideal for containers and serverless.
Network devices and OS send syslog (UDP/TCP 514) to a Splunk syslog input or syslog-ng/rsyslog intermediary.
Custom scripts run on a schedule to pull data from APIs or databases. Output goes to stdin → Splunk.
Pre-built data connectors for AWS, Azure, Office365, and more — handle authentication and field normalization automatically.
Hands-on: Forwarder Configuration
# Monitor a single log file [monitor:///var/log/app/application.log] index = prod_app sourcetype = app_logs disabled = false # Monitor a directory recursively [monitor:///var/log/nginx/...] index = prod_web sourcetype = nginx_access disabled = false # Monitor Windows Event Log [WinEventLog://Application] index = windows_events sourcetype = WinEventLog:Application disabled = false
[tcpout] defaultGroup = primary_indexers [tcpout:primary_indexers] server = indexer1.company.com:9997, indexer2.company.com:9997 useSSL = true sslCertPath = $SPLUNK_HOME/etc/certs/forwarder.pem sslRootCAPath = $SPLUNK_HOME/etc/auth/cacert.pem
# Send single event to HEC
curl -k https://splunk.company.com:8088/services/collector/event \
-H "Authorization: Splunk <HEC_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"time": 1745231122,
"host": "app-server-01",
"source": "payment-service",
"sourcetype": "payment_logs",
"index": "prod_app",
"event": {
"level": "ERROR",
"message": "Payment timeout",
"user_id": "u-4419",
"duration_ms": 5023
}
}'
# Send batch of events
curl -k https://splunk.company.com:8088/services/collector/event \
-H "Authorization: Splunk <HEC_TOKEN>" \
-d '{"event":"first event"}{"event":"second event"}'Indexing Pipeline
arrives
Line break,
timestamp,
source type
Masking,
routing,
field extract
Hot → Warm
→ Cold → Frozen
Source Types and Field Extraction
# props.conf — tell Splunk which sourcetype to apply transforms to [app_logs] TRANSFORMS-extract = extract_app_fields # transforms.conf — define the regex extraction [extract_app_fields] REGEX = level=(?P<log_level>\w+)\s+msg="(?P<message>[^"]+)"\s+user=(?P<user_id>\S+) FORMAT = log_level::$1 message::$2 user_id::$3 WRITE_META = true
Debugging Scenarios
- Forwarder not sending data: Run
./splunk list monitoron the forwarder to confirm inputs are active. Check outputs.conf points to the correct indexer IP and port 9997. - Data appearing in wrong index: Check outputs.conf
_TCP_ROUTINGand inputs.confindexstanza. Forwarder-level index setting overrides default. - Wrong sourcetype assigned: Add explicit
sourcetypein the inputs.conf monitor stanza instead of relying on auto-detection. - HEC returning 403: Token is invalid or disabled — regenerate HEC token in Settings → Data Inputs → HTTP Event Collector.
- Timestamps parsed incorrectly: Add
TIME_FORMATandTIME_PREFIXto props.conf for the affected sourcetype.
Real-world Use Case
A Kubernetes-based e-commerce platform shipped logs using HEC from all microservice containers. No agents on pods, no file mounting — each container POSTed structured JSON to the HEC endpoint via an environment variable token injected by the platform team. Index routing was handled by the index field in the JSON payload, allowing each team to own their own Splunk index with separate RBAC.
Interview Questions
Beginner
A Splunk configuration file that defines data inputs — file monitors, network ports, Windows event logs — and their target indexes.
HTTP Event Collector — allows apps to POST events directly to Splunk over HTTPS without a forwarder agent.
A label in Splunk that identifies the format and structure of incoming data, used to apply parsing rules and field extractions.
Port 9997 (default) for Splunk-to-Splunk forwarding over TCP.
The named repository where ingested events are stored. Defined in the index stanza of inputs.conf and created on the indexer.
Intermediate
Use _TCP_ROUTING in inputs.conf or a Heavy Forwarder with transforms.conf routing stanzas based on source or field values.
The sequence of processing steps: input → parsing (line-breaking, timestamping) → transforms (masking, routing) → indexing into buckets.
When you need to parse, mask PII, route events conditionally, or filter events before indexing — heavy forwarder does pre-processing to reduce indexer load.
A time-based directory containing compressed raw data and index files. Moves through states: hot (active write) → warm → cold → frozen (archived/deleted).
Use SEDCMD or REGEX-based TRANSFORMS with MASK replacement in transforms.conf on the Heavy Forwarder before data reaches the indexer.
Scenario-based
Use HEC — inject the endpoint URL and token as environment variables. The app POSTs structured JSON events directly to Splunk over HTTPS.
1. Verify outputs.conf has correct indexer address/port. 2. Check if the file path in inputs.conf exists. 3. Check network/firewall on port 9997. 4. Review $SPLUNK_HOME/var/log/splunk/splunkd.log for errors.
Search index=_internal source=*metrics.log* group=per_sourcetype_thruput to see ingest by sourcetype and identify the spike source.
Deploy Universal Forwarders to all 500 servers via the Deployment Server. Configure WinEventLog inputs via a server-class app pushed from the deployment server.
Use a TRANSFORMS with REGEX-based NULLQUEUE in transforms.conf to drop DEBUG-level events at the Heavy Forwarder before they reach the indexer.
Summary
Data ingestion is the foundation of Splunk value. The right method — Universal Forwarder for servers, HEC for ephemeral/cloud workloads, syslog for network devices — combined with correct sourcetype, index routing, and field extraction determines search quality for everything downstream.