BeginnerLesson 1 of 9

Observability Fundamentals

Understand the three pillars of modern observability — metrics, logs, and traces — and why this discipline underpins every production monitoring platform.

Simple Explanation (ELI5)

Imagine your production system is a car. Metrics are the dashboard gauges — speed, temperature, fuel. Logs are the black box recorder — every event and error, timestamped. Traces are the GPS track — a complete route map of each single journey from start to finish. Observability means you can look at all three together and understand exactly why the car broke down.

Observability vs Monitoring

Monitoring asks "is this system up or down?" using predefined thresholds — it tells you that something is wrong. Observability asks "why is this system behaving unexpectedly?" using telemetry data — it tells you what and why. A fully monitored system can still be unobservable if you can't trace the root cause of novel failures.

DimensionMonitoringObservability
Question answeredIs it up?Why is it slow/broken?
TriggerThreshold breachAnomalous behavior
Pre-requisiteKnown failure modesRich telemetry from system
ToolingNagios, basic alertingDynatrace, Datadog, Prometheus+Grafana
ValueAlert on known unknownsInvestigate unknown unknowns

The Three Pillars

📊 Metrics

Time-series numeric measurements: CPU %, request rate, error count, response time P95. Aggregatable and efficient to store. Best for dashboards, alerting, capacity planning.

📋 Logs

Timestamped free-text or structured event records from applications and infrastructure. Verbose and detailed — best for debugging the exact sequence of events that led to a failure.

🕸️ Traces

End-to-end record of a single request as it moves through distributed services. Contains spans (unit of work), latency, errors, and service dependencies. Best for latency attribution in microservices.

🔗 Correlation

The real power: linking a metric anomaly to the relevant logs and then drilling into the trace for that request — eliminating MTTD from hours to minutes.

Telemetry Pipeline Architecture

App / Infra
(Source)
Instrumentation
(Agent / SDK)
Collector /
Pipeline
Backend
(Store & Index)
Query &
Visualize

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for vendor-neutral telemetry instrumentation. It provides a single SDK for collecting metrics, logs, and traces from any language/framework and exporting to any backend (Dynatrace, Datadog, Jaeger, etc.).

python — OpenTelemetry manual trace span
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# Initialize tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="https://your-env.live.dynatrace.com/api/v2/otlp/v1/traces",
                             headers={"Authorization": "Api-Token YOUR_TOKEN"})
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("my-service")

# Manual instrumentation
with tracer.start_as_current_span("checkout.process") as span:
    span.set_attribute("user.id", "user-123")
    span.set_attribute("order.total", 149.99)
    # your business logic here
    result = process_payment()

The Observability Maturity Model

LevelCapabilityTooling
Level 0No telemetry — "is it running?"Manual checks, ping
Level 1Basic metrics and uptime alertsNagios, CloudWatch alarms
Level 2Metrics + dashboards + log aggregationPrometheus, ELK, Splunk
Level 3Distributed tracing, service mapsJaeger, Zipkin, Dynatrace APM
Level 4AI-driven root cause, automatic baseliningDynatrace Davis, Datadog Watchdog

Signals in Practice

yaml — Metric, log, and trace example (same request)
# METRIC (Prometheus format)
http_request_duration_seconds{service="checkout", method="POST", status="500"} 8.423

# LOG (structured JSON)
{
  "timestamp": "2026-04-21T14:23:01Z",
  "level": "ERROR",
  "service": "checkout",
  "trace_id": "3f9a1b2c4d5e6f70",
  "message": "Payment gateway timeout after 8s",
  "user_id": "user-123",
  "order_id": "ord-789"
}

# TRACE SPAN (OpenTelemetry)
{
  "traceId": "3f9a1b2c4d5e6f70",
  "spanId": "9a1b2c3d",
  "parentSpanId": "0000000000000000",
  "name": "POST /checkout",
  "startTime": "2026-04-21T14:23:01Z",
  "duration": 8423,
  "status": "ERROR",
  "attributes": {
    "http.status_code": 500,
    "service.name": "checkout-service"
  }
}

Debugging Scenarios

Real-world Use Case

An e-commerce platform's checkout conversion rate dropped by 3% on Black Friday. Metrics showed elevated P99 latency on the checkout service. Logs showed intermittent "payment gateway timeout" errors. Distributed traces revealed the exact span: a POST call to a third-party payment API was taking 8+ seconds every ~50 requests. Root cause: the payment gateway's rate limiter was throttling traffic without returning HTTP 429 — only a trace on the specific slow request revealed the behavior. The team added a client-side circuit breaker and resolved the issue in 20 minutes.

Interview Questions

Beginner

What are the three pillars of observability?

Metrics (numeric time-series), Logs (event records), and Traces (end-to-end request paths). Together they provide complete system understanding.

What is the difference between monitoring and observability?

Monitoring tells you that something is wrong using predefined thresholds. Observability tells you why using rich telemetry data — including for failure modes you didn't anticipate.

What is a trace span?

A span is a named unit of work within a distributed trace — representing a single operation (HTTP call, DB query, cache lookup). Spans are linked by parent-child relationships to form a trace tree.

What is OpenTelemetry?

A CNCF standard SDK and protocol for vendor-neutral telemetry instrumentation — enabling metrics, logs, and traces to be collected once and exported to any compatible backend.

Why are structured logs better than unstructured?

Structured logs (JSON/key-value) have named fields that can be automatically indexed and filtered, making them far faster to search and correlate with metrics and traces.

Intermediate

What is cardinality in the context of metrics?

Cardinality is the number of unique label combinations for a metric. High cardinality (e.g., per-user metrics) causes storage explosion and query performance issues in time-series databases.

What is trace context propagation?

Passing the trace ID and span ID in request headers (W3C TraceContext format) so every downstream service can link its spans to the originating trace — enabling end-to-end correlation.

When would you use traces over logs for debugging?

Use traces when you need to understand latency attribution across multiple services — which service or call is responsible for the slow P99. Logs tell you what happened; traces tell you how long each step took and where time was spent.

What is the RED method for service metrics?

Rate (requests per second), Errors (error rate), Duration (latency distribution). A minimal but sufficient set of metrics for any request-based microservice.

What is sampling in distributed tracing?

Only capturing a fraction of traces (e.g., 1%) to manage storage and performance overhead. Head-based sampling decides at trace start; tail-based sampling decides after seeing the full trace (better for capturing errors/slow requests).

Scenario-based

Users report intermittent slowness but your dashboards look normal. How do you investigate?

Normal dashboards show averages — check P99/P99.9 latency percentiles. Pull distributed traces for slow requests to find which service/span introduces the delay. Correlate with logs for that trace ID to find root cause.

Your team wants to implement observability for a new microservices platform. Where do you start?

Start with metrics (RED method per service), add structured logging with trace context injection, then add distributed tracing. Use OpenTelemetry SDK to instrument once and avoid vendor lock-in.

How do you correlate a metric anomaly with its root cause?

Use trace IDs embedded in logs and metrics labels to link the anomaly time window to specific distributed traces. Drill into the trace waterfall to find the slow or failing span, then inspect logs for that service at that timestamp.

What is the difference between black-box and white-box monitoring?

Black-box: testing externally visible behavior (HTTP probes, synthetic tests) — good for detecting user-facing impact. White-box: instrumentation inside the system (APM, traces, logs) — good for root cause analysis. Production systems need both.

A third-party API your service calls has no tracing. How do you handle this?

Create a wrapper span around each call to the external API in your service. Record duration, status code, and error states as span attributes. This makes the external dependency visible in your traces even though you can't instrument their side.

Summary

Observability is the foundation of reliable production systems. Metrics provide the signal, logs provide the context, and traces provide the path. Together they enable engineers to answer "why is this service behaving unexpectedly?" rather than just "is it up?" — the critical shift from reactive monitoring to proactive engineering.