Troubleshooting
Step-by-step diagnosis when Grafana dashboards show no data, bad values, or noisy alerts.
Simple Explanation (ELI5)
If charts look wrong, check the path from metric source to panel: app emits metrics, Prometheus scrapes, Grafana queries.
Technical Explanation
Most issues are caused by label mismatches, wrong time ranges, stale targets, query complexity, or datasource permissions. Use Explore mode and Prometheus expression browser to compare results quickly.
Troubleshooting Checklist
- Datasource health: Grafana connection test.
- Prometheus target health: /targets up and recently scraped.
- Time range sanity: panel range vs retention.
- Label verification: exact metric labels in Prometheus.
- Query cost: narrow cardinality and reduce range vectors.
Hands-on Commands
# Check scrape targets
kubectl port-forward svc/prometheus-server 9090:80
# Open /targets and confirm all critical jobs are UP
# Compare raw metric in Prometheus expression browser
up{job="kubernetes-pods"}
# Validate request metric labels before dashboard query
sum by (service, namespace, status) (rate(http_requests_total[5m]))Common Incident Patterns
- No Data: wrong datasource, expired token, or misspelled metric.
- Spiky Graphs: small scrape interval with long-rate windows mismatch.
- Alert Storm: missing for-duration and poor grouping in notification policy.
- Slow Panels: high-cardinality labels (pod UID, request path) in query.
Real-world Use Case
An on-call team saw blank latency panels after migration. Root cause was updated metric name in app instrumentation. Fix included backward-compatible recording rules and dashboard query updates.
Interview Questions
Beginner
Datasource connection and raw metric existence in Prometheus.
Metrics may exist outside selected window.
Query filters on labels not present in metric series.
Thresholds too tight and no debounce duration.
Grafana Explore and Prometheus expression browser.
Intermediate
Use recording rules and avoid high-cardinality dimensions.
No `for` duration and unstable baseline around threshold.
Compare label sets before/after deployment and annotation timestamps.
Check scrape config, relabel rules, and namespace filters.
To inspect raw labels/series before aggregation.
Scenario-based
Compare alert query, panel query, and evaluation interval consistency.
Datasource routing, cluster label, and target health for that cluster.
Series cardinality explodes with traffic dimensions.
Metric rename/instrumentation changes in new version.
Create dashboard tests/checklists and version dashboards with releases.
Summary
Reliable troubleshooting comes from systematic checks across datasource, scrape pipeline, labels, and query performance.