Hands-onLesson 8 of 9

Troubleshooting

Step-by-step diagnosis when Grafana dashboards show no data, bad values, or noisy alerts.

Simple Explanation (ELI5)

If charts look wrong, check the path from metric source to panel: app emits metrics, Prometheus scrapes, Grafana queries.

Technical Explanation

Most issues are caused by label mismatches, wrong time ranges, stale targets, query complexity, or datasource permissions. Use Explore mode and Prometheus expression browser to compare results quickly.

Troubleshooting Checklist

Hands-on Commands

bash/promql
# Check scrape targets
kubectl port-forward svc/prometheus-server 9090:80
# Open /targets and confirm all critical jobs are UP

# Compare raw metric in Prometheus expression browser
up{job="kubernetes-pods"}

# Validate request metric labels before dashboard query
sum by (service, namespace, status) (rate(http_requests_total[5m]))

Common Incident Patterns

Real-world Use Case

An on-call team saw blank latency panels after migration. Root cause was updated metric name in app instrumentation. Fix included backward-compatible recording rules and dashboard query updates.

Interview Questions

Beginner

First check when panel shows no data?

Datasource connection and raw metric existence in Prometheus.

Why does time range matter?

Metrics may exist outside selected window.

What is label mismatch?

Query filters on labels not present in metric series.

Why can alerts be noisy?

Thresholds too tight and no debounce duration.

Where to validate query quickly?

Grafana Explore and Prometheus expression browser.

Intermediate

How reduce panel query latency?

Use recording rules and avoid high-cardinality dimensions.

Why can alert rules flap?

No `for` duration and unstable baseline around threshold.

How debug mismatched dashboards after deploy?

Compare label sets before/after deployment and annotation timestamps.

How troubleshoot missing namespace data?

Check scrape config, relabel rules, and namespace filters.

When use table panel for debugging?

To inspect raw labels/series before aggregation.

Scenario-based

CPU alert fired but graph looks normal. What now?

Compare alert query, panel query, and evaluation interval consistency.

Only one cluster has no data. Which checks?

Datasource routing, cluster label, and target health for that cluster.

Panel loads slowly only during peak traffic. Why?

Series cardinality explodes with traffic dimensions.

Request panel dropped after app release. Suspect?

Metric rename/instrumentation changes in new version.

How prevent repeat incidents?

Create dashboard tests/checklists and version dashboards with releases.

Summary

Reliable troubleshooting comes from systematic checks across datasource, scrape pipeline, labels, and query performance.