Observability Fundamentals
Understand the three pillars of modern observability — metrics, logs, and traces — and why this discipline underpins every production monitoring platform.
Simple Explanation (ELI5)
Imagine your production system is a car. Metrics are the dashboard gauges — speed, temperature, fuel. Logs are the black box recorder — every event and error, timestamped. Traces are the GPS track — a complete route map of each single journey from start to finish. Observability means you can look at all three together and understand exactly why the car broke down.
Observability vs Monitoring
Monitoring asks "is this system up or down?" using predefined thresholds — it tells you that something is wrong. Observability asks "why is this system behaving unexpectedly?" using telemetry data — it tells you what and why. A fully monitored system can still be unobservable if you can't trace the root cause of novel failures.
| Dimension | Monitoring | Observability |
|---|---|---|
| Question answered | Is it up? | Why is it slow/broken? |
| Trigger | Threshold breach | Anomalous behavior |
| Pre-requisite | Known failure modes | Rich telemetry from system |
| Tooling | Nagios, basic alerting | Dynatrace, Datadog, Prometheus+Grafana |
| Value | Alert on known unknowns | Investigate unknown unknowns |
The Three Pillars
Time-series numeric measurements: CPU %, request rate, error count, response time P95. Aggregatable and efficient to store. Best for dashboards, alerting, capacity planning.
Timestamped free-text or structured event records from applications and infrastructure. Verbose and detailed — best for debugging the exact sequence of events that led to a failure.
End-to-end record of a single request as it moves through distributed services. Contains spans (unit of work), latency, errors, and service dependencies. Best for latency attribution in microservices.
The real power: linking a metric anomaly to the relevant logs and then drilling into the trace for that request — eliminating MTTD from hours to minutes.
Telemetry Pipeline Architecture
(Source)
(Agent / SDK)
Pipeline
(Store & Index)
Visualize
OpenTelemetry
OpenTelemetry (OTel) is the CNCF standard for vendor-neutral telemetry instrumentation. It provides a single SDK for collecting metrics, logs, and traces from any language/framework and exporting to any backend (Dynatrace, Datadog, Jaeger, etc.).
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# Initialize tracer
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="https://your-env.live.dynatrace.com/api/v2/otlp/v1/traces",
headers={"Authorization": "Api-Token YOUR_TOKEN"})
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-service")
# Manual instrumentation
with tracer.start_as_current_span("checkout.process") as span:
span.set_attribute("user.id", "user-123")
span.set_attribute("order.total", 149.99)
# your business logic here
result = process_payment()The Observability Maturity Model
| Level | Capability | Tooling |
|---|---|---|
| Level 0 | No telemetry — "is it running?" | Manual checks, ping |
| Level 1 | Basic metrics and uptime alerts | Nagios, CloudWatch alarms |
| Level 2 | Metrics + dashboards + log aggregation | Prometheus, ELK, Splunk |
| Level 3 | Distributed tracing, service maps | Jaeger, Zipkin, Dynatrace APM |
| Level 4 | AI-driven root cause, automatic baselining | Dynatrace Davis, Datadog Watchdog |
Signals in Practice
# METRIC (Prometheus format)
http_request_duration_seconds{service="checkout", method="POST", status="500"} 8.423
# LOG (structured JSON)
{
"timestamp": "2026-04-21T14:23:01Z",
"level": "ERROR",
"service": "checkout",
"trace_id": "3f9a1b2c4d5e6f70",
"message": "Payment gateway timeout after 8s",
"user_id": "user-123",
"order_id": "ord-789"
}
# TRACE SPAN (OpenTelemetry)
{
"traceId": "3f9a1b2c4d5e6f70",
"spanId": "9a1b2c3d",
"parentSpanId": "0000000000000000",
"name": "POST /checkout",
"startTime": "2026-04-21T14:23:01Z",
"duration": 8423,
"status": "ERROR",
"attributes": {
"http.status_code": 500,
"service.name": "checkout-service"
}
}Debugging Scenarios
- Metrics show high latency but logs show no errors: The issue is likely a slow downstream dependency — use traces to identify which span is taking the longest.
- Logs show errors but metrics look fine: The errors may be a small percentage of traffic. Add error-rate metric or dashboard to surface the signal.
- Trace IDs don't appear in logs: Instrumentation doesn't inject trace context — add trace context propagation to the logging configuration.
- Telemetry gaps every 5 minutes: Scrape interval or collection batch flush issue. Check collector configuration and network connectivity.
Real-world Use Case
An e-commerce platform's checkout conversion rate dropped by 3% on Black Friday. Metrics showed elevated P99 latency on the checkout service. Logs showed intermittent "payment gateway timeout" errors. Distributed traces revealed the exact span: a POST call to a third-party payment API was taking 8+ seconds every ~50 requests. Root cause: the payment gateway's rate limiter was throttling traffic without returning HTTP 429 — only a trace on the specific slow request revealed the behavior. The team added a client-side circuit breaker and resolved the issue in 20 minutes.
Interview Questions
Beginner
Metrics (numeric time-series), Logs (event records), and Traces (end-to-end request paths). Together they provide complete system understanding.
Monitoring tells you that something is wrong using predefined thresholds. Observability tells you why using rich telemetry data — including for failure modes you didn't anticipate.
A span is a named unit of work within a distributed trace — representing a single operation (HTTP call, DB query, cache lookup). Spans are linked by parent-child relationships to form a trace tree.
A CNCF standard SDK and protocol for vendor-neutral telemetry instrumentation — enabling metrics, logs, and traces to be collected once and exported to any compatible backend.
Structured logs (JSON/key-value) have named fields that can be automatically indexed and filtered, making them far faster to search and correlate with metrics and traces.
Intermediate
Cardinality is the number of unique label combinations for a metric. High cardinality (e.g., per-user metrics) causes storage explosion and query performance issues in time-series databases.
Passing the trace ID and span ID in request headers (W3C TraceContext format) so every downstream service can link its spans to the originating trace — enabling end-to-end correlation.
Use traces when you need to understand latency attribution across multiple services — which service or call is responsible for the slow P99. Logs tell you what happened; traces tell you how long each step took and where time was spent.
Rate (requests per second), Errors (error rate), Duration (latency distribution). A minimal but sufficient set of metrics for any request-based microservice.
Only capturing a fraction of traces (e.g., 1%) to manage storage and performance overhead. Head-based sampling decides at trace start; tail-based sampling decides after seeing the full trace (better for capturing errors/slow requests).
Scenario-based
Normal dashboards show averages — check P99/P99.9 latency percentiles. Pull distributed traces for slow requests to find which service/span introduces the delay. Correlate with logs for that trace ID to find root cause.
Start with metrics (RED method per service), add structured logging with trace context injection, then add distributed tracing. Use OpenTelemetry SDK to instrument once and avoid vendor lock-in.
Use trace IDs embedded in logs and metrics labels to link the anomaly time window to specific distributed traces. Drill into the trace waterfall to find the slow or failing span, then inspect logs for that service at that timestamp.
Black-box: testing externally visible behavior (HTTP probes, synthetic tests) — good for detecting user-facing impact. White-box: instrumentation inside the system (APM, traces, logs) — good for root cause analysis. Production systems need both.
Create a wrapper span around each call to the external API in your service. Record duration, status code, and error states as span attributes. This makes the external dependency visible in your traces even though you can't instrument their side.
Summary
Observability is the foundation of reliable production systems. Metrics provide the signal, logs provide the context, and traces provide the path. Together they enable engineers to answer "why is this service behaving unexpectedly?" rather than just "is it up?" — the critical shift from reactive monitoring to proactive engineering.