Hands-on Lesson 13 of 14

Debugging Scenarios

Master the systematic approach to diagnosing CrashLoopBackOff, ImagePullBackOff, Pending Pods, Service routing, and networking issues.

🧒 Simple Explanation (ELI5)

When a car won't start, a mechanic checks battery → fuel → spark plugs in a systematic order. Kubernetes debugging works the same way. This lesson teaches you the exact diagnostic flowchart for every common failure state — so you can go from "something's broken" to "fixed" in minutes.

📊 The Debugging Flowchart

Universal K8s Debug Flow

kubectl get pods
What's the STATUS?

Pending

→ describe pod

→ check resources

ImagePull*

→ describe pod

→ check image

CrashLoop

→ logs

→ check command

Running ✗

→ logs, exec

→ check probes

🔴 Scenario 1: CrashLoopBackOff

What it means: The container starts, crashes, restarts, crashes again. Kubernetes backs off exponentially between restarts (10s → 20s → 40s → ... → 5min max).

Common Causes

Cause	How to detect
Application error / exception	`kubectl logs <pod>` — stack trace
Missing config file / env var	Logs show "file not found" or "env not set"
Wrong command / args	Logs show "exec format error" or immediate exit
Insufficient permissions	Logs: "permission denied"
Database connection failed	Logs: "connection refused" to DB host
OOMKilled	`kubectl describe pod` → Last State: OOMKilled

Step-by-Step Debug

bash

# 1. Get the pod name and current status
kubectl get pods -n <namespace>
# NAME          READY   STATUS             RESTARTS   AGE
# myapp-abc12   0/1     CrashLoopBackOff   5          3m

# 2. Check logs (current crash)
kubectl logs myapp-abc12 -n <namespace>

# 3. Check previous crash logs (if container already restarted)
kubectl logs myapp-abc12 -n <namespace> --previous

# 4. Check events and exit code
kubectl describe pod myapp-abc12 -n <namespace>
# Look for:
#   Last State: Terminated
#   Exit Code: 1 (app error), 137 (OOMKilled/SIGKILL), 139 (segfault)

# 5. If logs are empty, check the command
kubectl get pod myapp-abc12 -o jsonpath='{.spec.containers[0].command}' -n <namespace>

# 6. Debug interactively — override entrypoint
kubectl run debug-pod --image=<same-image> -it --rm -- /bin/sh
# Then manually run the app command to see the error

💡

Exit Code Reference

Exit 0: Success (shouldn't crash-loop). Exit 1: Application error. Exit 126: Permission denied. Exit 127: Command not found. Exit 137: Container killed (OOM or external SIGKILL). Exit 139: Segmentation fault.

🔴 Scenario 2: ImagePullBackOff

What it means: Kubernetes can't pull the container image from the registry.

Common Causes

Cause	How to detect
Image name typo	Events: "manifest unknown" or "not found"
Tag doesn't exist	Events: "manifest for image:tag not found"
Private registry, no credentials	Events: "unauthorized" or "access denied"
Registry unreachable	Events: "dial tcp: connection refused"

Step-by-Step Debug

bash

# 1. Describe pod → read Events section
kubectl describe pod <pod-name> -n <namespace>
# Events:
#   Failed to pull image "myrepo/myimage:v9.99": ... not found

# 2. Verify the image exists (locally)
docker pull myrepo/myimage:v9.99
# If this fails, the image doesn't exist or you need auth

# 3. For private registries — create/check imagePullSecrets
kubectl get secrets -n <namespace> | grep docker
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}' -n <namespace>

# 4. Create image pull secret if missing
kubectl create secret docker-registry regcred \
  --docker-server=<registry-url> \
  --docker-username=<user> \
  --docker-password=<pass> \
  -n <namespace>

# 5. Add to deployment spec:
# spec.template.spec.imagePullSecrets:
#   - name: regcred

🔴 Scenario 3: Pending Pods

What it means: The pod can't be scheduled to any node.

Common Causes

Cause	How to detect
Insufficient CPU/memory	Events: "Insufficient cpu" or "Insufficient memory"
No matching nodeSelector/affinity	Events: "0/3 nodes are available: ... didn't match selector"
Taints not tolerated	Events: "... had taint {key=value:NoSchedule}"
PVC not bound	Events: "persistentvolumeclaim ... not bound"
ResourceQuota exceeded	Events: "exceeded quota"

Step-by-Step Debug

bash

# 1. Describe the pending pod
kubectl describe pod <pod-name> -n <namespace>
# Events will tell you WHY scheduling failed

# 2. Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
# See how much capacity is left on each node

# 3. Check node taints
kubectl describe nodes | grep -A3 Taints
# If nodes have taints, pods need matching tolerations

# 4. Check node labels (for nodeSelector)
kubectl get nodes --show-labels

# 5. Check PVC status (if using persistent volumes)
kubectl get pvc -n <namespace>
# STATUS should be "Bound", not "Pending"

# 6. Check ResourceQuota
kubectl get resourcequota -n <namespace>
kubectl describe resourcequota -n <namespace>

# Fixes:
# - Reduce resource requests
# - Add tolerations for node taints
# - Fix nodeSelector labels
# - Provision PersistentVolume
# - Increase ResourceQuota or free up resources

🔴 Scenario 4: Service Not Reaching Pods

What it means: The application is running but unreachable through the Service.

Step-by-Step Debug

bash

# 1. Verify pods are actually Running and Ready
kubectl get pods -n <namespace> -l <your-selector>
# All should be 1/1 Running

# 2. Check Service exists and has endpoints
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
# If ENDPOINTS is <none>, the selector doesn't match any pods

# 3. Compare labels
kubectl get pods -n <namespace> --show-labels
kubectl get svc <service-name> -n <namespace> -o jsonpath='{.spec.selector}'
# Must match exactly

# 4. Check the target port
kubectl get svc <service-name> -n <namespace> -o yaml
# spec.ports[].targetPort must match the container's actual port

# 5. Test connectivity from within the cluster
kubectl run test-curl --image=curlimages/curl -it --rm -- \
  curl -s http://<service-name>.<namespace>.svc.cluster.local:<port>

# 6. Test direct pod connectivity
kubectl get endpoints <service-name> -n <namespace>
kubectl run test-curl --image=curlimages/curl -it --rm -- \
  curl -s http://<pod-ip>:<container-port>
# If direct works but service doesn't, it's a service/selector issue

🔑

DNS debugging

If DNS doesn't resolve: kubectl run dns-test --image=busybox -it --rm -- nslookup <service-name>.<namespace>.svc.cluster.local. If this fails, check CoreDNS pods: kubectl get pods -n kube-system -l k8s-app=kube-dns.

🔴 Scenario 5: Ingress Returns 404 / 502 / 503

Debug by Error Code

Code	Meaning	Check
404	Ingress controller can't find a matching rule	Ingress host, path, ingressClassName
502	Ingress reached the service but backend failed	Backend pods crashing or not ready
503	No healthy backends available	Service has 0 endpoints

bash

# 1. Verify ingress resource
kubectl get ingress -n <namespace>
kubectl describe ingress <name> -n <namespace>
# Check: host, path, backend service name, port

# 2. Check ingress controller is running
kubectl get pods -n ingress-nginx
# All should be Running

# 3. Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# 4. Verify backend service has endpoints
kubectl get endpoints <backend-svc> -n <namespace>

# 5. For TLS issues
kubectl get secret <tls-secret> -n <namespace>
# Verify the secret exists and has tls.crt + tls.key

🔴 Scenario 6: Node NotReady

bash

# 1. Check node status
kubectl get nodes
# STATUS: NotReady

# 2. Describe the node
kubectl describe node <node-name>
# Look at Conditions section:
#   MemoryPressure, DiskPressure, PIDPressure, Ready

# 3. Check kubelet on the node (SSH required)
systemctl status kubelet
journalctl -u kubelet -f --no-pager | tail -50

# 4. Common causes:
# - kubelet stopped → systemctl restart kubelet
# - Certificate expired → kubeadm certs renew all
# - Disk full → free disk space
# - Network issue → check CNI plugin pods: kubectl get pods -n kube-system

🛠️ Essential Debugging Commands Reference

bash

# === POD DEBUGGING ===
kubectl get pods -A                              # All pods, all namespaces
kubectl get pods -o wide                         # Show node & IP
kubectl describe pod <name>                      # Full details + events
kubectl logs <pod>                               # Current container logs
kubectl logs <pod> --previous                    # Previous crash logs
kubectl logs <pod> -c <container>                # Specific container (multi-container)
kubectl logs <pod> -f                            # Stream live logs
kubectl exec -it <pod> -- /bin/sh                # Shell into container
kubectl cp <pod>:/path/file ./local-file         # Copy file from pod

# === SERVICE / NETWORK ===
kubectl get svc,endpoints -n <ns>                # Service + endpoints together
kubectl port-forward svc/<name> 8080:80          # Quick local test

# === CLUSTER / NODES ===
kubectl get nodes -o wide                        # Node status + versions
kubectl top nodes                                # Resource usage (needs metrics-server)
kubectl top pods -n <ns>                         # Pod resource usage
kubectl get events -n <ns> --sort-by=.lastTimestamp  # Recent events

# === ADVANCED ===
kubectl get pods --field-selector=status.phase=Failed  # Filter by status
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'
kubectl api-resources                            # All available resource types
kubectl explain pod.spec.containers              # Built-in docs

📊 Decision Tree: Quick Reference

Symptom	First Command	Next Steps
Pod won't start	`kubectl describe pod`	Read Events section → follow the error message
Pod starts then crashes	`kubectl logs --previous`	Read the crash log → fix application error
Pod running but app broken	`kubectl exec -it -- sh`	Check config, test connectivity, verify env vars
Can't reach the app	`kubectl get endpoints`	No endpoints → fix labels. Has endpoints → check ports/DNS
Everything looks fine but slow	`kubectl top pods`	CPU/memory throttling → adjust resources
Cluster unstable	`kubectl get nodes`	NotReady nodes → check kubelet, disk, network

📝 Summary

CrashLoopBackOff: Check logs (current + previous) and exit codes first
ImagePullBackOff: Verify image name, tag, and registry credentials
Pending: Describe pod for scheduling failures — resources, taints, selectors, PVCs
Service issues: Always check endpoints — if empty, fix selector labels
Ingress errors: Match the HTTP code to the cause (404=routing, 502=backend crash, 503=no endpoints)
Golden rule: kubectl describe and kubectl logs solve 90% of all issues

← PreviousBreak & Fix Next →Interview Preparation

← Back to Kubernetes Course