Real-world Dashboards
Build practical dashboards for CPU, memory, and request monitoring with Prometheus-backed Grafana panels.
Simple Explanation (ELI5)
A real dashboard should tell you immediately if users are affected and where to investigate first.
Technical Explanation
Production dashboards should combine RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) viewpoints. For app services: request rate, error rate, latency. For infrastructure: CPU, memory, node pressure, restarts.
Visual Section
Node and pod CPU trend + top consumers
Working set, OOM/restarts, node pressure
RPS, 5xx %, p95 latency
Hands-on Commands
# CPU by namespace
sum by (namespace) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
# Memory by namespace
sum by (namespace) (container_memory_working_set_bytes{container!=""})
# Request rate by service
sum by (service) (rate(http_requests_total[5m]))
# p95 latency (histogram)
histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))Debugging Scenarios
- CPU panel always zero: metric from wrong job/cluster.
- Memory panel huge jumps: unit mismatch bytes vs MiB.
- Request panel undercounting: filtered status/method labels incorrectly.
Real-world Use Case
During a traffic surge, request dashboard showed normal RPS but p95 latency doubled and CPU concentrated on one namespace; team scaled specific workload, not whole cluster.
Interview Questions
Beginner
Request rate, error rate, and latency.
To correlate app symptoms with resource pressure.
95% of requests are faster than this value.
Shows which pod/service drives resource spikes.
Rate of request counter over time window.
Intermediate
Top rows for RED, lower rows for USE/resource drilldowns.
Enables ownership and blast-radius scoping in incidents.
Use quantiles/histograms and segmented views.
15-30s for critical dashboards, slower for broad overviews.
Fast visual correlation between release events and regressions.
Scenario-based
Dependency/database and error dashboards for downstream bottlenecks.
Leak or cache issue, not traffic-driven scaling.
Add region variable and per-region panel breakdown.
Yes, verify scrape/label path immediately because blind spots hide failures.
It supports incident triage quickly and consistently across on-call rotations.
Summary
Real-world dashboards should be decision tools: CPU, memory, and request health views tied directly to incident workflows.