Dashboards
Design maintainable Grafana dashboards that support fast diagnosis and team collaboration.
Simple Explanation (ELI5)
A dashboard is a page of charts that tells the story of system health at a glance.
Technical Explanation
Good dashboard design follows hierarchy: service overview at top, saturation and dependency panels below, then component details. Use variables for environment, namespace, and service. Keep panel units and thresholds explicit.
Visual Section
SLI summary panels
Latency, errors, traffic
CPU, memory, pod details
Hands-on Commands
{
"title": "Checkout Service",
"tags": ["prod", "api"],
"timezone": "browser",
"refresh": "30s"
}Debugging Scenarios
- Panels show mixed units: set explicit units per panel.
- Dashboard unreadable on-call: reduce panels and prioritize SLIs.
- Wrong environment shown: variable defaults misconfigured.
Real-world Use Case
A service dashboard with request rate, error rate, and p95 latency reduced incident triage time from 20 minutes to 5 minutes.
Interview Questions
Beginner
Clear layout, useful metrics, consistent units, and fast insight.
To reuse one dashboard across environments/services.
Service health summary metrics.
Based on need; 15-60s for ops is common.
They dilute signal and slow diagnosis.
Intermediate
Exec summary, service owner detail, and on-call drilldown dashboards.
Use recording rules, fewer panels, and sane refresh intervals.
Mark deploys/incidents to correlate metric changes with events.
Use templates, folders, naming standards, and reviews.
When one page becomes overloaded or crosses ownership boundaries.
Scenario-based
Cut expensive panels, shorten range defaults, add recording rules.
Variable or label filter mismatch.
Datasource separation and strict environment variables.
Focus on high-signal metrics and remove clutter.
Track MTTD/MTTR and on-call feedback.
Summary
Dashboards are operational products. Design them for fast decisions, not visual decoration.