Exit 0: Success (shouldn't crash-loop). Exit 1: Application error. Exit 126: Permission denied. Exit 127: Command not found. Exit 137: Container killed (OOM or external SIGKILL). Exit 139: Segmentation fault.
Debugging Scenarios
Master the systematic approach to diagnosing CrashLoopBackOff, ImagePullBackOff, Pending Pods, Service routing, and networking issues.
🧒 Simple Explanation (ELI5)
When a car won't start, a mechanic checks battery → fuel → spark plugs in a systematic order. Kubernetes debugging works the same way. This lesson teaches you the exact diagnostic flowchart for every common failure state — so you can go from "something's broken" to "fixed" in minutes.
📊 The Debugging Flowchart
What's the STATUS?
🔴 Scenario 1: CrashLoopBackOff
What it means: The container starts, crashes, restarts, crashes again. Kubernetes backs off exponentially between restarts (10s → 20s → 40s → ... → 5min max).
Common Causes
| Cause | How to detect |
|---|---|
| Application error / exception | kubectl logs <pod> — stack trace |
| Missing config file / env var | Logs show "file not found" or "env not set" |
| Wrong command / args | Logs show "exec format error" or immediate exit |
| Insufficient permissions | Logs: "permission denied" |
| Database connection failed | Logs: "connection refused" to DB host |
| OOMKilled | kubectl describe pod → Last State: OOMKilled |
Step-by-Step Debug
# 1. Get the pod name and current status
kubectl get pods -n <namespace>
# NAME READY STATUS RESTARTS AGE
# myapp-abc12 0/1 CrashLoopBackOff 5 3m
# 2. Check logs (current crash)
kubectl logs myapp-abc12 -n <namespace>
# 3. Check previous crash logs (if container already restarted)
kubectl logs myapp-abc12 -n <namespace> --previous
# 4. Check events and exit code
kubectl describe pod myapp-abc12 -n <namespace>
# Look for:
# Last State: Terminated
# Exit Code: 1 (app error), 137 (OOMKilled/SIGKILL), 139 (segfault)
# 5. If logs are empty, check the command
kubectl get pod myapp-abc12 -o jsonpath='{.spec.containers[0].command}' -n <namespace>
# 6. Debug interactively — override entrypoint
kubectl run debug-pod --image=<same-image> -it --rm -- /bin/sh
# Then manually run the app command to see the error
🔴 Scenario 2: ImagePullBackOff
What it means: Kubernetes can't pull the container image from the registry.
Common Causes
| Cause | How to detect |
|---|---|
| Image name typo | Events: "manifest unknown" or "not found" |
| Tag doesn't exist | Events: "manifest for image:tag not found" |
| Private registry, no credentials | Events: "unauthorized" or "access denied" |
| Registry unreachable | Events: "dial tcp: connection refused" |
Step-by-Step Debug
# 1. Describe pod → read Events section
kubectl describe pod <pod-name> -n <namespace>
# Events:
# Failed to pull image "myrepo/myimage:v9.99": ... not found
# 2. Verify the image exists (locally)
docker pull myrepo/myimage:v9.99
# If this fails, the image doesn't exist or you need auth
# 3. For private registries — create/check imagePullSecrets
kubectl get secrets -n <namespace> | grep docker
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}' -n <namespace>
# 4. Create image pull secret if missing
kubectl create secret docker-registry regcred \
--docker-server=<registry-url> \
--docker-username=<user> \
--docker-password=<pass> \
-n <namespace>
# 5. Add to deployment spec:
# spec.template.spec.imagePullSecrets:
# - name: regcred
🔴 Scenario 3: Pending Pods
What it means: The pod can't be scheduled to any node.
Common Causes
| Cause | How to detect |
|---|---|
| Insufficient CPU/memory | Events: "Insufficient cpu" or "Insufficient memory" |
| No matching nodeSelector/affinity | Events: "0/3 nodes are available: ... didn't match selector" |
| Taints not tolerated | Events: "... had taint {key=value:NoSchedule}" |
| PVC not bound | Events: "persistentvolumeclaim ... not bound" |
| ResourceQuota exceeded | Events: "exceeded quota" |
Step-by-Step Debug
# 1. Describe the pending pod kubectl describe pod <pod-name> -n <namespace> # Events will tell you WHY scheduling failed # 2. Check node resources kubectl describe nodes | grep -A5 "Allocated resources" # See how much capacity is left on each node # 3. Check node taints kubectl describe nodes | grep -A3 Taints # If nodes have taints, pods need matching tolerations # 4. Check node labels (for nodeSelector) kubectl get nodes --show-labels # 5. Check PVC status (if using persistent volumes) kubectl get pvc -n <namespace> # STATUS should be "Bound", not "Pending" # 6. Check ResourceQuota kubectl get resourcequota -n <namespace> kubectl describe resourcequota -n <namespace> # Fixes: # - Reduce resource requests # - Add tolerations for node taints # - Fix nodeSelector labels # - Provision PersistentVolume # - Increase ResourceQuota or free up resources
🔴 Scenario 4: Service Not Reaching Pods
What it means: The application is running but unreachable through the Service.
Step-by-Step Debug
# 1. Verify pods are actually Running and Ready
kubectl get pods -n <namespace> -l <your-selector>
# All should be 1/1 Running
# 2. Check Service exists and has endpoints
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
# If ENDPOINTS is <none>, the selector doesn't match any pods
# 3. Compare labels
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
# Must match exactly
# 4. Check the target port
kubectl get svc <service-name> -n <namespace> -o yaml
# spec.ports[].targetPort must match the container's actual port
# 5. Test connectivity from within the cluster
kubectl run test-curl --image=curlimages/curl -it --rm -- \
curl -s http://<service-name>.<namespace>.svc.cluster.local:<port>
# 6. Test direct pod connectivity
kubectl get endpoints <service-name> -n <namespace>
kubectl run test-curl --image=curlimages/curl -it --rm -- \
curl -s http://<pod-ip>:<container-port>
# If direct works but service doesn't, it's a service/selector issue
If DNS doesn't resolve: kubectl run dns-test --image=busybox -it --rm -- nslookup <service-name>.<namespace>.svc.cluster.local. If this fails, check CoreDNS pods: kubectl get pods -n kube-system -l k8s-app=kube-dns.
🔴 Scenario 5: Ingress Returns 404 / 502 / 503
Debug by Error Code
| Code | Meaning | Check |
|---|---|---|
| 404 | Ingress controller can't find a matching rule | Ingress host, path, ingressClassName |
| 502 | Ingress reached the service but backend failed | Backend pods crashing or not ready |
| 503 | No healthy backends available | Service has 0 endpoints |
# 1. Verify ingress resource kubectl get ingress -n <namespace> kubectl describe ingress <name> -n <namespace> # Check: host, path, backend service name, port # 2. Check ingress controller is running kubectl get pods -n ingress-nginx # All should be Running # 3. Check ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx # 4. Verify backend service has endpoints kubectl get endpoints <backend-svc> -n <namespace> # 5. For TLS issues kubectl get secret <tls-secret> -n <namespace> # Verify the secret exists and has tls.crt + tls.key
🔴 Scenario 6: Node NotReady
# 1. Check node status kubectl get nodes # STATUS: NotReady # 2. Describe the node kubectl describe node <node-name> # Look at Conditions section: # MemoryPressure, DiskPressure, PIDPressure, Ready # 3. Check kubelet on the node (SSH required) systemctl status kubelet journalctl -u kubelet -f --no-pager | tail -50 # 4. Common causes: # - kubelet stopped → systemctl restart kubelet # - Certificate expired → kubeadm certs renew all # - Disk full → free disk space # - Network issue → check CNI plugin pods: kubectl get pods -n kube-system
🛠️ Essential Debugging Commands Reference
# === POD DEBUGGING ===
kubectl get pods -A # All pods, all namespaces
kubectl get pods -o wide # Show node & IP
kubectl describe pod <name> # Full details + events
kubectl logs <pod> # Current container logs
kubectl logs <pod> --previous # Previous crash logs
kubectl logs <pod> -c <container> # Specific container (multi-container)
kubectl logs <pod> -f # Stream live logs
kubectl exec -it <pod> -- /bin/sh # Shell into container
kubectl cp <pod>:/path/file ./local-file # Copy file from pod
# === SERVICE / NETWORK ===
kubectl get svc,endpoints -n <ns> # Service + endpoints together
kubectl port-forward svc/<name> 8080:80 # Quick local test
# === CLUSTER / NODES ===
kubectl get nodes -o wide # Node status + versions
kubectl top nodes # Resource usage (needs metrics-server)
kubectl top pods -n <ns> # Pod resource usage
kubectl get events -n <ns> --sort-by=.lastTimestamp # Recent events
# === ADVANCED ===
kubectl get pods --field-selector=status.phase=Failed # Filter by status
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
kubectl api-resources # All available resource types
kubectl explain pod.spec.containers # Built-in docs
📊 Decision Tree: Quick Reference
| Symptom | First Command | Next Steps |
|---|---|---|
| Pod won't start | kubectl describe pod | Read Events section → follow the error message |
| Pod starts then crashes | kubectl logs --previous | Read the crash log → fix application error |
| Pod running but app broken | kubectl exec -it -- sh | Check config, test connectivity, verify env vars |
| Can't reach the app | kubectl get endpoints | No endpoints → fix labels. Has endpoints → check ports/DNS |
| Everything looks fine but slow | kubectl top pods | CPU/memory throttling → adjust resources |
| Cluster unstable | kubectl get nodes | NotReady nodes → check kubelet, disk, network |
📝 Summary
- CrashLoopBackOff: Check logs (current + previous) and exit codes first
- ImagePullBackOff: Verify image name, tag, and registry credentials
- Pending: Describe pod for scheduling failures — resources, taints, selectors, PVCs
- Service issues: Always check endpoints — if empty, fix selector labels
- Ingress errors: Match the HTTP code to the cause (404=routing, 502=backend crash, 503=no endpoints)
- Golden rule:
kubectl describeandkubectl logssolve 90% of all issues