Hands-onLesson 10 of 11

Troubleshooting

Debug the most common Prometheus failures: metrics not collected, targets down, empty graphs, noisy alerts, and broken Kubernetes scraping.

Simple Explanation (ELI5)

Troubleshooting Prometheus usually means answering one question first: did Prometheus fail to discover the target, fail to reach it, fail to store it, or did we just ask the wrong query?

Real-world Analogy

If a parcel never arrives, you check the address, then the delivery route, then the warehouse, then the recipient. Prometheus troubleshooting works the same way: target, network, scrape, storage, then query.

Technical Explanation

The fastest troubleshooting path is layered. Start with target discovery, then endpoint reachability, then scrape status, then raw metric presence, then PromQL and dashboards. In Kubernetes, also inspect ServiceMonitor or PodMonitor selectors, namespace visibility, and network policies.

Visual Representation

1. Discover

Is target listed?

2. Reach

Can Prometheus connect?

3. Query

Does raw metric exist?

Commands / Syntax

bash
# Prometheus APIs
curl http://localhost:9090/api/v1/targets
curl http://localhost:9090/api/v1/label/__name__/values
curl http://localhost:9090/api/v1/query?query=up

# Check endpoint directly
curl http://my-app:8080/metrics | head

# Kubernetes checks
kubectl get servicemonitor,podmonitor -A
kubectl describe servicemonitor checkout-api -n monitoring
kubectl get svc,pods -n prod -l app=checkout-api -o wide
kubectl logs deploy/monitoring-kube-prometheus-operator -n monitoring

Example (Real-world Use Case)

A new service in production has no dashboard data. The operator logs show no scrape config generated for the ServiceMonitor. Investigation reveals the ServiceMonitor lives in a namespace not watched by the Prometheus instance. Fixing the namespace selector immediately restores metrics.

Hands-on Section

  1. Break one target intentionally by changing its scrape port.
  2. Watch the target switch to DOWN in the Prometheus UI.
  3. Restore connectivity and verify up returns to 1.
  4. Break a ServiceMonitor selector and observe how the target disappears entirely.

Try It Yourself

Debugging Runbook 1: Metrics Not Collected

Debugging Runbook 2: Kubernetes ServiceMonitor Not Working

Debugging Runbook 3: Empty Dashboard but Healthy Target

Debugging Runbook 4: Alert Fires with No Real Incident

Interview Questions

Beginner

What is the first place to check when metrics are missing?

The Prometheus target status page, because it tells you whether the target was discovered and whether scraping succeeds.

What does up = 0 mean?

Prometheus attempted to scrape the target and failed.

What is the difference between a missing target and a down target?

A missing target was never discovered or included. A down target was discovered but scraping failed.

Why check the raw metrics endpoint directly?

It confirms whether the app or exporter is actually exposing the metric before you blame Prometheus.

Why can Grafana show empty graphs even when Prometheus is healthy?

Because the dashboard query, labels, or time range may be wrong even though Prometheus has the data.

Intermediate

How do you debug a ServiceMonitor mismatch?

Compare selector labels, namespace selectors, port names, and ensure the Prometheus instance watches that namespace.

What Prometheus APIs help in debugging?

/api/v1/targets, /api/v1/query, /api/v1/rules, and label or metric metadata endpoints are very useful.

How can network policy break monitoring in Kubernetes?

It can block the Prometheus pod from reaching app pods or services even when the app itself is healthy.

Why might a metric exist but your query still return nothing?

The query may use incorrect label filters, wrong aggregation, or the selected time range may not include the series.

How do you confirm whether a rule file is loaded?

Check the Prometheus rules API or rules page and confirm the expected rule group appears without evaluation errors.

Scenario-based

A service exports metrics on /actuator/prometheus but Prometheus scrapes /metrics. What happens?

The target is likely discovered but down or returns unexpected content because the scrape path is wrong.

Prometheus target is up, but one critical metric vanished after a deploy. What do you inspect?

I inspect the app instrumentation changes, metric relabeling, and whether the metric name or labels changed in the new version.

Operator logs show RBAC errors. What monitoring symptom might you see?

Prometheus may fail to discover or reconcile monitor resources, causing missing targets and incomplete scrape configs.

An alert fires every night at midnight but there is no incident. What do you investigate?

I investigate scheduled jobs, backups, compactions, cron workloads, or daily maintenance that shifts system behavior predictably.

What order do you follow when a developer says “Prometheus is broken”?

I check discovery, target reachability, raw endpoint output, raw metric presence in Prometheus, then dashboard or alert logic. That order avoids wasted effort.

Summary

Prometheus troubleshooting is most effective when you stay disciplined: discovery first, connectivity second, raw metric third, PromQL fourth, dashboards last. In Kubernetes, selectors, port names, namespaces, and RBAC are recurring failure points.