Real-world Scenarios
Work through production-style scenarios including CPU spikes, memory growth, Kubernetes scraping, and noisy alert patterns.
Simple Explanation (ELI5)
This lesson turns Prometheus from theory into incident response. Instead of just knowing what a metric is, you will use metrics to answer “what is broken?” and “what should we do next?”
Real-world Analogy
Reading a driving manual is different from handling a car skid in the rain. Real scenarios train you to use the dashboard under pressure. Prometheus scenarios do the same for platform incidents.
Technical Explanation
Real incidents require correlation across multiple signals. CPU spikes alone do not prove an outage. You correlate CPU with latency, error rate, queue depth, pod restarts, node pressure, and deployment timing. In Kubernetes, the fastest diagnosis often comes from combining platform and app metrics.
Visual Representation
Latency high
CPU, memory, errors, restarts
Scale, rollback, tune, or fix code
Commands / Syntax
# CPU spike by pod
sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
# Memory working set by pod
sum by (pod) (container_memory_working_set_bytes{container!=""})
# HTTP 5xx error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Pod restart increases
increase(kube_pod_container_status_restarts_total[15m])Example (Real-world Use Case)
A payment API in Kubernetes experiences a 4x CPU spike and rising latency after a feature rollout. Prometheus shows CPU growth isolated to the new deployment version, memory is stable, and restart count remains flat. The team rolls back the deployment instead of scaling blindly.
Hands-on Section
- Simulate a CPU-heavy pod and graph CPU by pod and namespace.
- Trigger memory growth in a test service and compare working set growth over time.
- Use error-rate PromQL to detect a failing API version.
- Check pod restarts and deployment rollout timestamps to correlate cause.
Try It Yourself
- Write one query that helps prove whether a CPU spike is localized to one workload.
- Describe when memory growth is a leak versus expected cache growth.
- List the first three graphs you would open for a slow API incident.
Scenario 1: CPU Spike
Symptom: API latency rose from 120 ms to 850 ms. Investigation: CPU usage by pod shows only checkout-v2 pods are spiking. Error rate remains low, but latency is rising. Decision: Roll back checkout-v2. Lesson: CPU spikes are often rollout-specific, not always capacity-wide.
Scenario 2: Memory Usage Creeping Up
Symptom: One pod’s memory climbs slowly over six hours until OOMKilled. Investigation: Gauge trends show monotonic memory growth, restart count increases, and request volume stays flat. Decision: Investigate application leak rather than cluster capacity.
Scenario 3: Kubernetes Target Missing
Symptom: New service has no dashboards. Investigation: ServiceMonitor exists, but selector labels do not match the Service labels. Decision: Fix label mismatch and confirm target discovery.
Scenario 4: Alert Storm During Node Maintenance
Symptom: Multiple node and pod alerts fire during planned maintenance. Investigation: Expected drain event not represented in silences. Decision: Add maintenance silences and node-drain-aware suppression patterns.
Interview Questions
Beginner
CPU usage by pod or instance, request rate, latency, and node saturation to see whether the spike is localized or broad.
If memory steadily increases without stabilizing and leads to restarts or pressure, it may indicate a leak or runaway cache.
Crashes, OOM kills, configuration errors, or failing dependencies.
Latency often reflects user experience directly and helps distinguish harmless resource movement from real impact.
up is a good first check to see whether critical targets are reachable, followed by request and error rate queries.
Intermediate
If only the new version shows bad metrics while cluster-wide capacity is otherwise fine, it points to a deployment issue rather than pure scaling need.
Working set memory, restart count, OOM events, GC behavior if available, and request volume to rule out traffic-driven growth.
Because many incidents are introduced by releases, and correlation helps isolate recent changes quickly.
Pods and nodes can look healthy while the application returns errors or degraded responses. App metrics are still necessary.
It groups metrics by incident workflow, such as traffic, errors, saturation, and rollout state, instead of random technical categories.
Scenario-based
I check error rate, request queueing, database latency, thread pool saturation, and downstream dependency metrics. CPU is not the whole story.
I inspect ServiceMonitor or PodMonitor selectors, metrics port naming, namespace selectors, RBAC, and whether the app exposes /metrics.
I compare metrics by version label, deployment timeline, and pod set. If degradation starts with new replicas only, rollout correlation is strong.
Add maintenance silences, environment-aware rules, and suppression for known operational workflows like drains or rolling upgrades.
I check whether the workload is stateless and horizontally scalable, whether one hot shard exists, and whether the bottleneck is actually CPU rather than a dependency.
Summary
Real-world Prometheus work is about correlation, not isolated metrics. CPU spikes, memory growth, and missing Kubernetes targets become manageable when you combine workload, platform, and rollout context.