Logging Fundamentals
Understand what logs are, why they matter, and how production systems emit, structure, and retain them.
Simple Explanation (ELI5)
Logs are like a diary your applications write automatically. Every time something happens — a user logs in, an error is thrown, a payment is processed — the application writes a note. Later, when something goes wrong, you read the diary to find out what happened.
Real-world Analogy
Think of a flight recorder (black box) on an aircraft. It records every pilot action, every system reading, and every anomaly in flight. When investigators need to understand a crash, they read the recorder. Logs are your application's black box.
Technical Explanation
A log is a time-stamped record of an event emitted by an application, operating system, network device, or service. Logs capture: what happened, when it happened, who triggered it, and what the outcome was. Systems emit logs continuously — production environments can generate millions of log lines per minute.
Two broad log types exist: unstructured (free-text, like Apache access logs) and structured (key-value or JSON, easier to query). Modern observability practice strongly favors structured logging because search tools like Splunk can extract fields automatically.
Log Levels
Detailed developer info. Noisy in production. Used in local/test environments.
Normal operational events. Service started, user logged in, job completed successfully.
Something unexpected but recoverable. Retry succeeded, deprecated API called.
A failure that impacted a single operation. DB timeout, file not found, API 500.
Application or service is crashing or unusable. Needs immediate attention.
Most granular level. Every function call, loop iteration. Typically disabled.
Structured vs Unstructured Logs
# Unstructured (Apache access log)
192.168.1.10 - - [21/Apr/2026:10:45:22 +0000] "GET /api/health HTTP/1.1" 200 512
# Structured (JSON)
{
"timestamp": "2026-04-21T10:45:22Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment gateway timeout",
"user_id": "u-4419",
"duration_ms": 5023,
"trace_id": "abc-123"
}
# Key-value (syslog style)
Apr 21 10:45:22 app-server payment[1234]: level=ERROR msg="Gateway timeout" user=u-4419 duration=5023Log Lifecycle
emits log
Agent collects
(TCP/TLS/HEC)
stores & indexes
Log Retention and Rotation
- Retention policy: How long logs are kept (7 days hot, 90 days warm, 1 year cold/archive).
- Log rotation: Files are split by size or time (e.g., daily rotation) to prevent disk fill.
- Centralized logging: Ship logs off the host immediately; never rely on local disk alone.
Debugging Scenarios
- No logs at all: Check log level is not FATAL-only; verify logging framework is initialized.
- Log files filling disk: Implement rotation and off-host shipping before local maxsize.
- Mixed formats in same source: Inconsistent app versions logging differently — enforce log standard via shared library.
- Timestamps in wrong timezone: Always log in UTC; convert at display layer.
Real-world Use Case
A payment service failed silently for 15 minutes. Without structured logging there was no way to know how many transactions failed or which user IDs were affected. After migrating to JSON structured logging with trace IDs, the engineering team could replay all failed transactions within 30 seconds of an incident start using a single SPL query.
Interview Questions
Beginner
A time-stamped record of an event emitted by an application, OS, or service.
DEBUG, INFO, WARN, ERROR, FATAL (or CRITICAL), and sometimes TRACE.
Structured logs (JSON/KV) have typed, named fields that tools can parse automatically. Unstructured logs are free text requiring regex extraction.
Local logs are lost when a pod/instance restarts. Centralized logging survives host failures and allows cross-service correlation.
The policy defining how long logs are kept — balancing compliance requirements, cost, and diagnostic utility.
Intermediate
A unique identifier shared across all log entries for a single request, enabling end-to-end tracing across microservices.
Only writing a fraction of DEBUG/INFO logs to reduce volume. Used when cardinality is too high and INFO logs dominate cost.
Logging records discrete events as text; monitoring tracks numeric metrics over time. They are complementary — metrics detect anomalies, logs explain them.
Collecting logs from many sources into a single platform (Splunk, ELK, Loki) for unified search and analysis.
Logs are shared with many teams and stored persistently. Secrets in logs become a security vulnerability — use masking or scrubbing before emission.
Scenario-based
Search for ERROR or FATAL entries in the window before the crash, sorted by timestamp, to find the root event.
Log sampling may be dropping errors, the load balancer may route some traffic to an unhealthy node not logging, or log level filters exclude some errors.
Log level accidentally set to DEBUG in production, a new service logging too verbosely, or a slow/looping process emitting errors repeatedly.
Structured JSON logs with a shared trace ID injected at API gateway, propagated through each service, with consistent field names (user_id, amount, status) for cross-service correlation.
Hot storage for 30 days (fast search), warm for 90 days, then cold/archive (S3 Glacier or Splunk SmartStore) for the remainder — with clear restore procedures.
Summary
Logs are the foundation of observability. Structured logging with consistent fields, centralized collection, and clear retention policies transform raw text into actionable operational intelligence.