Hands-onLesson 9 of 11

Real-world Scenarios

Work through production-style scenarios including CPU spikes, memory growth, Kubernetes scraping, and noisy alert patterns.

Simple Explanation (ELI5)

This lesson turns Prometheus from theory into incident response. Instead of just knowing what a metric is, you will use metrics to answer “what is broken?” and “what should we do next?”

Real-world Analogy

Reading a driving manual is different from handling a car skid in the rain. Real scenarios train you to use the dashboard under pressure. Prometheus scenarios do the same for platform incidents.

Technical Explanation

Real incidents require correlation across multiple signals. CPU spikes alone do not prove an outage. You correlate CPU with latency, error rate, queue depth, pod restarts, node pressure, and deployment timing. In Kubernetes, the fastest diagnosis often comes from combining platform and app metrics.

Visual Representation

Symptom

Latency high

Correlate

CPU, memory, errors, restarts

Action

Scale, rollback, tune, or fix code

Commands / Syntax

promql
# CPU spike by pod
sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))

# Memory working set by pod
sum by (pod) (container_memory_working_set_bytes{container!=""})

# HTTP 5xx error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Pod restart increases
increase(kube_pod_container_status_restarts_total[15m])

Example (Real-world Use Case)

A payment API in Kubernetes experiences a 4x CPU spike and rising latency after a feature rollout. Prometheus shows CPU growth isolated to the new deployment version, memory is stable, and restart count remains flat. The team rolls back the deployment instead of scaling blindly.

Hands-on Section

  1. Simulate a CPU-heavy pod and graph CPU by pod and namespace.
  2. Trigger memory growth in a test service and compare working set growth over time.
  3. Use error-rate PromQL to detect a failing API version.
  4. Check pod restarts and deployment rollout timestamps to correlate cause.

Try It Yourself

Scenario 1: CPU Spike

Symptom: API latency rose from 120 ms to 850 ms. Investigation: CPU usage by pod shows only checkout-v2 pods are spiking. Error rate remains low, but latency is rising. Decision: Roll back checkout-v2. Lesson: CPU spikes are often rollout-specific, not always capacity-wide.

Scenario 2: Memory Usage Creeping Up

Symptom: One pod’s memory climbs slowly over six hours until OOMKilled. Investigation: Gauge trends show monotonic memory growth, restart count increases, and request volume stays flat. Decision: Investigate application leak rather than cluster capacity.

Scenario 3: Kubernetes Target Missing

Symptom: New service has no dashboards. Investigation: ServiceMonitor exists, but selector labels do not match the Service labels. Decision: Fix label mismatch and confirm target discovery.

Scenario 4: Alert Storm During Node Maintenance

Symptom: Multiple node and pod alerts fire during planned maintenance. Investigation: Expected drain event not represented in silences. Decision: Add maintenance silences and node-drain-aware suppression patterns.

Interview Questions

Beginner

What metrics would you check first for a CPU spike?

CPU usage by pod or instance, request rate, latency, and node saturation to see whether the spike is localized or broad.

How do you know if memory growth is dangerous?

If memory steadily increases without stabilizing and leads to restarts or pressure, it may indicate a leak or runaway cache.

What does increasing pod restart count usually suggest?

Crashes, OOM kills, configuration errors, or failing dependencies.

Why is request latency important during incidents?

Latency often reflects user experience directly and helps distinguish harmless resource movement from real impact.

What is a good first PromQL query during an outage?

up is a good first check to see whether critical targets are reachable, followed by request and error rate queries.

Intermediate

How do you tell the difference between a scaling issue and a bad deployment?

If only the new version shows bad metrics while cluster-wide capacity is otherwise fine, it points to a deployment issue rather than pure scaling need.

Which metrics help validate a memory leak hypothesis?

Working set memory, restart count, OOM events, GC behavior if available, and request volume to rule out traffic-driven growth.

Why correlate rollout timestamps with metric changes?

Because many incidents are introduced by releases, and correlation helps isolate recent changes quickly.

How can Kubernetes metrics hide app issues?

Pods and nodes can look healthy while the application returns errors or degraded responses. App metrics are still necessary.

What makes a scenario-based dashboard useful?

It groups metrics by incident workflow, such as traffic, errors, saturation, and rollout state, instead of random technical categories.

Scenario-based

Users report slowness, but CPU is only 40%. What next?

I check error rate, request queueing, database latency, thread pool saturation, and downstream dependency metrics. CPU is not the whole story.

Your Kubernetes app has no metrics after deployment. What do you inspect?

I inspect ServiceMonitor or PodMonitor selectors, metrics port naming, namespace selectors, RBAC, and whether the app exposes /metrics.

How would you prove a rollout caused an incident?

I compare metrics by version label, deployment timeline, and pod set. If degradation starts with new replicas only, rollout correlation is strong.

A node drain triggered 20 alerts. What process improvement follows?

Add maintenance silences, environment-aware rules, and suppression for known operational workflows like drains or rolling upgrades.

How do you tell whether scaling out will help a CPU spike?

I check whether the workload is stateless and horizontally scalable, whether one hot shard exists, and whether the bottleneck is actually CPU rather than a dependency.

Summary

Real-world Prometheus work is about correlation, not isolated metrics. CPU spikes, memory growth, and missing Kubernetes targets become manageable when you combine workload, platform, and rollout context.