Troubleshooting
Debug the most common Prometheus failures: metrics not collected, targets down, empty graphs, noisy alerts, and broken Kubernetes scraping.
Simple Explanation (ELI5)
Troubleshooting Prometheus usually means answering one question first: did Prometheus fail to discover the target, fail to reach it, fail to store it, or did we just ask the wrong query?
Real-world Analogy
If a parcel never arrives, you check the address, then the delivery route, then the warehouse, then the recipient. Prometheus troubleshooting works the same way: target, network, scrape, storage, then query.
Technical Explanation
The fastest troubleshooting path is layered. Start with target discovery, then endpoint reachability, then scrape status, then raw metric presence, then PromQL and dashboards. In Kubernetes, also inspect ServiceMonitor or PodMonitor selectors, namespace visibility, and network policies.
Visual Representation
Is target listed?
Can Prometheus connect?
Does raw metric exist?
Commands / Syntax
# Prometheus APIs curl http://localhost:9090/api/v1/targets curl http://localhost:9090/api/v1/label/__name__/values curl http://localhost:9090/api/v1/query?query=up # Check endpoint directly curl http://my-app:8080/metrics | head # Kubernetes checks kubectl get servicemonitor,podmonitor -A kubectl describe servicemonitor checkout-api -n monitoring kubectl get svc,pods -n prod -l app=checkout-api -o wide kubectl logs deploy/monitoring-kube-prometheus-operator -n monitoring
Example (Real-world Use Case)
A new service in production has no dashboard data. The operator logs show no scrape config generated for the ServiceMonitor. Investigation reveals the ServiceMonitor lives in a namespace not watched by the Prometheus instance. Fixing the namespace selector immediately restores metrics.
Hands-on Section
- Break one target intentionally by changing its scrape port.
- Watch the target switch to
DOWNin the Prometheus UI. - Restore connectivity and verify
upreturns to 1. - Break a ServiceMonitor selector and observe how the target disappears entirely.
Try It Yourself
- Write a step-by-step checklist for “metrics not collected.”
- Explain the difference between “target is down” and “target is not discovered.”
- Name one API endpoint that helps inspect raw Prometheus target state.
Debugging Runbook 1: Metrics Not Collected
- Check whether the target exists in
Status → Targets. - If absent, investigate discovery, selectors, annotations, ServiceMonitor, or PodMonitor.
- If present but down, test endpoint reachability and path.
- If present and up, query the raw metric name before touching Grafana.
Debugging Runbook 2: Kubernetes ServiceMonitor Not Working
- Compare Service labels to ServiceMonitor selector exactly.
- Verify the service port name matches the monitor endpoint port name.
- Check Prometheus namespace selectors and RBAC scope.
- Review operator logs for rejected or ignored monitor resources.
Debugging Runbook 3: Empty Dashboard but Healthy Target
- Check whether the dashboard query uses the correct metric and labels.
- Query the raw metric in Prometheus directly.
- Verify the time range includes recent data.
- Check whether recording rules failed or changed metric names.
Debugging Runbook 4: Alert Fires with No Real Incident
- Check whether the alert expression is too sensitive.
- Inspect whether the
forclause is missing or too short. - Correlate with user-impact metrics like latency and error rate.
- Review planned maintenance, rollout windows, or autoscaler behavior.
Interview Questions
Beginner
The Prometheus target status page, because it tells you whether the target was discovered and whether scraping succeeds.
up = 0 mean?Prometheus attempted to scrape the target and failed.
A missing target was never discovered or included. A down target was discovered but scraping failed.
It confirms whether the app or exporter is actually exposing the metric before you blame Prometheus.
Because the dashboard query, labels, or time range may be wrong even though Prometheus has the data.
Intermediate
Compare selector labels, namespace selectors, port names, and ensure the Prometheus instance watches that namespace.
/api/v1/targets, /api/v1/query, /api/v1/rules, and label or metric metadata endpoints are very useful.
It can block the Prometheus pod from reaching app pods or services even when the app itself is healthy.
The query may use incorrect label filters, wrong aggregation, or the selected time range may not include the series.
Check the Prometheus rules API or rules page and confirm the expected rule group appears without evaluation errors.
Scenario-based
/actuator/prometheus but Prometheus scrapes /metrics. What happens?The target is likely discovered but down or returns unexpected content because the scrape path is wrong.
I inspect the app instrumentation changes, metric relabeling, and whether the metric name or labels changed in the new version.
Prometheus may fail to discover or reconcile monitor resources, causing missing targets and incomplete scrape configs.
I investigate scheduled jobs, backups, compactions, cron workloads, or daily maintenance that shifts system behavior predictably.
I check discovery, target reachability, raw endpoint output, raw metric presence in Prometheus, then dashboard or alert logic. That order avoids wasted effort.
Summary
Prometheus troubleshooting is most effective when you stay disciplined: discovery first, connectivity second, raw metric third, PromQL fourth, dashboards last. In Kubernetes, selectors, port names, namespaces, and RBAC are recurring failure points.