BeginnerLesson 1 of 9

Logging Fundamentals

Understand what logs are, why they matter, and how production systems emit, structure, and retain them.

Simple Explanation (ELI5)

Logs are like a diary your applications write automatically. Every time something happens — a user logs in, an error is thrown, a payment is processed — the application writes a note. Later, when something goes wrong, you read the diary to find out what happened.

Real-world Analogy

Think of a flight recorder (black box) on an aircraft. It records every pilot action, every system reading, and every anomaly in flight. When investigators need to understand a crash, they read the recorder. Logs are your application's black box.

Technical Explanation

A log is a time-stamped record of an event emitted by an application, operating system, network device, or service. Logs capture: what happened, when it happened, who triggered it, and what the outcome was. Systems emit logs continuously — production environments can generate millions of log lines per minute.

Two broad log types exist: unstructured (free-text, like Apache access logs) and structured (key-value or JSON, easier to query). Modern observability practice strongly favors structured logging because search tools like Splunk can extract fields automatically.

Log Levels

DEBUG

Detailed developer info. Noisy in production. Used in local/test environments.

INFO

Normal operational events. Service started, user logged in, job completed successfully.

WARN

Something unexpected but recoverable. Retry succeeded, deprecated API called.

ERROR

A failure that impacted a single operation. DB timeout, file not found, API 500.

FATAL / CRITICAL

Application or service is crashing or unusable. Needs immediate attention.

TRACE

Most granular level. Every function call, loop iteration. Typically disabled.

Structured vs Unstructured Logs

log formats

# Unstructured (Apache access log)
192.168.1.10 - - [21/Apr/2026:10:45:22 +0000] "GET /api/health HTTP/1.1" 200 512

# Structured (JSON)
{
  "timestamp": "2026-04-21T10:45:22Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "Payment gateway timeout",
  "user_id": "u-4419",
  "duration_ms": 5023,
  "trace_id": "abc-123"
}

# Key-value (syslog style)
Apr 21 10:45:22 app-server payment[1234]: level=ERROR msg="Gateway timeout" user=u-4419 duration=5023

Log Lifecycle

Application
emits log

→

Forwarder/
Agent collects

→

Transport
(TCP/TLS/HEC)

→

Indexer
stores & indexes

Log Retention and Rotation

Retention policy: How long logs are kept (7 days hot, 90 days warm, 1 year cold/archive).
Log rotation: Files are split by size or time (e.g., daily rotation) to prevent disk fill.
Centralized logging: Ship logs off the host immediately; never rely on local disk alone.

Debugging Scenarios

No logs at all: Check log level is not FATAL-only; verify logging framework is initialized.
Log files filling disk: Implement rotation and off-host shipping before local maxsize.
Mixed formats in same source: Inconsistent app versions logging differently — enforce log standard via shared library.
Timestamps in wrong timezone: Always log in UTC; convert at display layer.

Real-world Use Case

A payment service failed silently for 15 minutes. Without structured logging there was no way to know how many transactions failed or which user IDs were affected. After migrating to JSON structured logging with trace IDs, the engineering team could replay all failed transactions within 30 seconds of an incident start using a single SPL query.

Interview Questions

Beginner

What is a log?▾

A time-stamped record of an event emitted by an application, OS, or service.

What are the main log levels?▾

DEBUG, INFO, WARN, ERROR, FATAL (or CRITICAL), and sometimes TRACE.

Difference between structured and unstructured logs?▾

Structured logs (JSON/KV) have typed, named fields that tools can parse automatically. Unstructured logs are free text requiring regex extraction.

Why is centralized logging important?▾

Local logs are lost when a pod/instance restarts. Centralized logging survives host failures and allows cross-service correlation.

What is log retention?▾

The policy defining how long logs are kept — balancing compliance requirements, cost, and diagnostic utility.

Intermediate

What is a trace ID and why is it used in logs?▾

A unique identifier shared across all log entries for a single request, enabling end-to-end tracing across microservices.

What is log sampling and when do you use it?▾

Only writing a fraction of DEBUG/INFO logs to reduce volume. Used when cardinality is too high and INFO logs dominate cost.

What is the difference between logging and monitoring?▾

Logging records discrete events as text; monitoring tracks numeric metrics over time. They are complementary — metrics detect anomalies, logs explain them.

What is log aggregation?▾

Collecting logs from many sources into a single platform (Splunk, ELK, Loki) for unified search and analysis.

Why should logs never contain secrets?▾

Logs are shared with many teams and stored persistently. Secrets in logs become a security vulnerability — use masking or scrubbing before emission.

Scenario-based

An application crashed 2 hours ago. What do you look for first in logs?▾

Search for ERROR or FATAL entries in the window before the crash, sorted by timestamp, to find the root event.

Users report intermittent 500 errors but logs show mostly 200s. Why?▾

Log sampling may be dropping errors, the load balancer may route some traffic to an unhealthy node not logging, or log level filters exclude some errors.

Log volume tripled overnight. What are likely causes?▾

Log level accidentally set to DEBUG in production, a new service logging too verbosely, or a slow/looping process emitting errors repeatedly.

How would you design logging for a microservices payment flow?▾

Structured JSON logs with a shared trace ID injected at API gateway, propagated through each service, with consistent field names (user_id, amount, status) for cross-service correlation.

Audit requires 7-year log retention. How do you architect this cost-effectively?▾

Hot storage for 30 days (fast search), warm for 90 days, then cold/archive (S3 Glacier or Splunk SmartStore) for the remainder — with clear restore procedures.

Summary

Logs are the foundation of observability. Structured logging with consistent fields, centralized collection, and clear retention policies transform raw text into actionable operational intelligence.

PreviousCourse Home ← Back to Course NextIntroduction to Splunk