AdvancedLesson 7 of 9

Real-world Dashboards

Build practical dashboards for CPU, memory, and request monitoring with Prometheus-backed Grafana panels.

Simple Explanation (ELI5)

A real dashboard should tell you immediately if users are affected and where to investigate first.

Technical Explanation

Production dashboards should combine RED (Rate, Errors, Duration) and USE (Utilization, Saturation, Errors) viewpoints. For app services: request rate, error rate, latency. For infrastructure: CPU, memory, node pressure, restarts.

Visual Section

CPU Dashboard

Node and pod CPU trend + top consumers

Memory Dashboard

Working set, OOM/restarts, node pressure

Requests Dashboard

RPS, 5xx %, p95 latency

Hands-on Commands

promql
# CPU by namespace
sum by (namespace) (rate(container_cpu_usage_seconds_total{container!=""}[5m]))

# Memory by namespace
sum by (namespace) (container_memory_working_set_bytes{container!=""})

# Request rate by service
sum by (service) (rate(http_requests_total[5m]))

# p95 latency (histogram)
histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))

Debugging Scenarios

Real-world Use Case

During a traffic surge, request dashboard showed normal RPS but p95 latency doubled and CPU concentrated on one namespace; team scaled specific workload, not whole cluster.

Interview Questions

Beginner

Which 3 panels are must-have for APIs?

Request rate, error rate, and latency.

Why include CPU and memory too?

To correlate app symptoms with resource pressure.

What is p95 latency?

95% of requests are faster than this value.

Why top consumer panel useful?

Shows which pod/service drives resource spikes.

What metric for request volume?

Rate of request counter over time window.

Intermediate

How combine RED and USE in one dashboard?

Top rows for RED, lower rows for USE/resource drilldowns.

Why query by namespace and service labels?

Enables ownership and blast-radius scoping in incidents.

How avoid misleading averages?

Use quantiles/histograms and segmented views.

How tune refresh interval for real-time ops?

15-30s for critical dashboards, slower for broad overviews.

Why annotations for deploys?

Fast visual correlation between release events and regressions.

Scenario-based

Latency high, CPU normal. Next dashboard?

Dependency/database and error dashboards for downstream bottlenecks.

Memory grows but requests stable. What suspect?

Leak or cache issue, not traffic-driven scaling.

Request spikes only in one region. How show it?

Add region variable and per-region panel breakdown.

One panel says no data during outage. Critical?

Yes, verify scrape/label path immediately because blind spots hide failures.

How decide dashboard is production-ready?

It supports incident triage quickly and consistently across on-call rotations.

Summary

Real-world dashboards should be decision tools: CPU, memory, and request health views tied directly to incident workflows.