BeginnerLesson 3 of 9

Data Ingestion

Learn how logs flow from applications and servers into Splunk indexes using forwarders, HEC, and syslog.

Simple Explanation (ELI5)

Data ingestion is the delivery system that gets your logs into Splunk. Think of it as a postal service: your applications write letters (logs), and the forwarder is the postal worker that picks them up, drives them to the sorting facility (indexer), and places them in the right mailbox (index).

Technical Explanation

Splunk ingests data through multiple input methods. The most common for server logs is the Universal Forwarder monitoring file paths. For cloud-native and microservices workloads, the HTTP Event Collector (HEC) is preferred — apps POST JSON events directly over HTTPS. Syslog inputs, scripted inputs, and Splunk Add-ons (TAs) handle specialized sources.

Once data arrives at the indexer, the pipeline processes it: line breaking → timestamp extraction → source type recognition → field extraction → indexing into buckets.

Ingestion Methods

Universal Forwarder

Monitors file paths and sends raw data. Configured via inputs.conf. Most common for server and application logs.

Heavy Forwarder

Full Splunk engine — parses, filters, masks sensitive data, and routes before indexing. Used for complex pipelines.

HTTP Event Collector (HEC)

Apps send JSON events via HTTPS POST to a Splunk endpoint. No agent needed. Ideal for containers and serverless.

Syslog

Network devices and OS send syslog (UDP/TCP 514) to a Splunk syslog input or syslog-ng/rsyslog intermediary.

Scripted Input

Custom scripts run on a schedule to pull data from APIs or databases. Output goes to stdin → Splunk.

Splunk Add-ons (TAs)

Pre-built data connectors for AWS, Azure, Office365, and more — handle authentication and field normalization automatically.

Hands-on: Forwarder Configuration

inputs.conf (Universal Forwarder)
# Monitor a single log file
[monitor:///var/log/app/application.log]
index = prod_app
sourcetype = app_logs
disabled = false

# Monitor a directory recursively
[monitor:///var/log/nginx/...]
index = prod_web
sourcetype = nginx_access
disabled = false

# Monitor Windows Event Log
[WinEventLog://Application]
index = windows_events
sourcetype = WinEventLog:Application
disabled = false
outputs.conf (Forwarder → Indexer)
[tcpout]
defaultGroup = primary_indexers

[tcpout:primary_indexers]
server = indexer1.company.com:9997, indexer2.company.com:9997
useSSL = true
sslCertPath = $SPLUNK_HOME/etc/certs/forwarder.pem
sslRootCAPath = $SPLUNK_HOME/etc/auth/cacert.pem
HEC — Sending events via curl
# Send single event to HEC
curl -k https://splunk.company.com:8088/services/collector/event \
  -H "Authorization: Splunk <HEC_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "time": 1745231122,
    "host": "app-server-01",
    "source": "payment-service",
    "sourcetype": "payment_logs",
    "index": "prod_app",
    "event": {
      "level": "ERROR",
      "message": "Payment timeout",
      "user_id": "u-4419",
      "duration_ms": 5023
    }
  }'

# Send batch of events
curl -k https://splunk.company.com:8088/services/collector/event \
  -H "Authorization: Splunk <HEC_TOKEN>" \
  -d '{"event":"first event"}{"event":"second event"}'

Indexing Pipeline

Raw Data
arrives
Parsing:
Line break,
timestamp,
source type
Transforms:
Masking,
routing,
field extract
Buckets:
Hot → Warm
→ Cold → Frozen

Source Types and Field Extraction

transforms.conf — Custom field extraction
# props.conf — tell Splunk which sourcetype to apply transforms to
[app_logs]
TRANSFORMS-extract = extract_app_fields

# transforms.conf — define the regex extraction
[extract_app_fields]
REGEX = level=(?P<log_level>\w+)\s+msg="(?P<message>[^"]+)"\s+user=(?P<user_id>\S+)
FORMAT = log_level::$1 message::$2 user_id::$3
WRITE_META = true

Debugging Scenarios

Real-world Use Case

A Kubernetes-based e-commerce platform shipped logs using HEC from all microservice containers. No agents on pods, no file mounting — each container POSTed structured JSON to the HEC endpoint via an environment variable token injected by the platform team. Index routing was handled by the index field in the JSON payload, allowing each team to own their own Splunk index with separate RBAC.

Interview Questions

Beginner

What is inputs.conf?

A Splunk configuration file that defines data inputs — file monitors, network ports, Windows event logs — and their target indexes.

What is HEC?

HTTP Event Collector — allows apps to POST events directly to Splunk over HTTPS without a forwarder agent.

What is a sourcetype?

A label in Splunk that identifies the format and structure of incoming data, used to apply parsing rules and field extractions.

What port does the Universal Forwarder use to send data?

Port 9997 (default) for Splunk-to-Splunk forwarding over TCP.

What is an index in the context of ingestion?

The named repository where ingested events are stored. Defined in the index stanza of inputs.conf and created on the indexer.

Intermediate

How do you route events from one forwarder to different indexes?

Use _TCP_ROUTING in inputs.conf or a Heavy Forwarder with transforms.conf routing stanzas based on source or field values.

What is the indexing pipeline?

The sequence of processing steps: input → parsing (line-breaking, timestamping) → transforms (masking, routing) → indexing into buckets.

When would you use a Heavy Forwarder instead of Universal?

When you need to parse, mask PII, route events conditionally, or filter events before indexing — heavy forwarder does pre-processing to reduce indexer load.

What is a bucket in Splunk?

A time-based directory containing compressed raw data and index files. Moves through states: hot (active write) → warm → cold → frozen (archived/deleted).

How do you handle PII in log data before indexing?

Use SEDCMD or REGEX-based TRANSFORMS with MASK replacement in transforms.conf on the Heavy Forwarder before data reaches the indexer.

Scenario-based

A new microservice needs to send logs. It runs in a container with no persistent storage. How do you ingest its logs?

Use HEC — inject the endpoint URL and token as environment variables. The app POSTs structured JSON events directly to Splunk over HTTPS.

Forwarder is running and inputs are configured but no events arrive. Checklist?

1. Verify outputs.conf has correct indexer address/port. 2. Check if the file path in inputs.conf exists. 3. Check network/firewall on port 9997. 4. Review $SPLUNK_HOME/var/log/splunk/splunkd.log for errors.

Ingestion is 3× normal volume today. How do you trace the cause?

Search index=_internal source=*metrics.log* group=per_sourcetype_thruput to see ingest by sourcetype and identify the spike source.

You need to ingest Windows Event Logs from 500 servers. Approach?

Deploy Universal Forwarders to all 500 servers via the Deployment Server. Configure WinEventLog inputs via a server-class app pushed from the deployment server.

A log file has both application logs and debug noise mixed in. How do you reduce index size?

Use a TRANSFORMS with REGEX-based NULLQUEUE in transforms.conf to drop DEBUG-level events at the Heavy Forwarder before they reach the indexer.

Summary

Data ingestion is the foundation of Splunk value. The right method — Universal Forwarder for servers, HEC for ephemeral/cloud workloads, syslog for network devices — combined with correct sourcetype, index routing, and field extraction determines search quality for everything downstream.