Debugging Scenarios — kubectl & az CLI Diagnostics
A systematic debugging toolkit for AKS. Learn the exact commands, order of operations, and mental models that SREs use to diagnose production issues on Azure Kubernetes Service.
🧠 The Debugging Mindset
Production debugging is not guessing. It's a systematic narrowing process:
- Cluster level — Are the nodes healthy? Is the API server responding?
- Namespace level — Are pods running? Any recent events?
- Pod level — What's the pod status? What do the logs say?
- Container level — Can I exec into it? What's the process doing?
- Network level — Can pods reach each other? Is DNS working?
- Azure level — Is the underlying VM healthy? Any Azure outages?
💡
Golden Rule
Always start from the outside in. Check cluster health before looking at individual pods. A node problem will cause cascading pod failures — fixing the node fixes all the pods.
🔧 Level 1 — Cluster Health
Is the API Server Responding?
bash
# Quick health check — if this hangs, the control plane is down kubectl cluster-info # Kubernetes control plane is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443 # CoreDNS is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443/api/v1/... # Check API server health directly kubectl get --raw='/healthz' # ok # Check component status (deprecated but still useful) kubectl get componentstatuses
Are Nodes Healthy?
bash
# List all nodes with status kubectl get nodes -o wide # Check node conditions — look for MemoryPressure, DiskPressure, PIDPressure kubectl describe nodes | grep -A15 "Conditions:" # Quick check: are any nodes NotReady? kubectl get nodes | grep -v " Ready " # Check node resource usage kubectl top nodes # NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% # aks-nodepool1-xxxxx-vmss0000 245m 12% 1834Mi 48% # aks-nodepool1-xxxxx-vmss0001 189m 9% 1562Mi 41%
Azure-Level Cluster Check
bash
# Check AKS cluster provisioning state
az aks show --resource-group $RG --name $CLUSTER \
--query "{state:provisioningState,powerState:powerState.code,k8sVersion:kubernetesVersion}" -o table
# State PowerState K8sVersion
# Succeeded Running 1.29.4
# Check for ongoing operations
az aks show --resource-group $RG --name $CLUSTER \
--query "provisioningState" -o tsv
# Get cluster diagnostic logs
az aks get-credentials --resource-group $RG --name $CLUSTER
az aks kollect --resource-group $RG --name $CLUSTER # Captures diagnostics bundle🔧 Level 2 — Namespace Overview
bash
# Get everything in a namespace at a glance kubectl get all -n myapp # Show recent events (MOST USEFUL COMMAND for debugging) kubectl get events -n myapp --sort-by='.lastTimestamp' | tail -20 # Find pods NOT in Running state kubectl get pods -n myapp --field-selector 'status.phase!=Running' # Get resource quota usage (if quotas are set) kubectl describe resourcequota -n myapp
❗
Events Are Your Best Friend
kubectl get events --sort-by='.lastTimestamp' is the single most useful debugging command. Events show scheduling failures, image pulls, volume mounts, liveness probe failures, and OOM kills — all in chronological order.
🔧 Level 3 — Pod Diagnostics
Pod Status Troubleshooting Matrix
| Status | What It Means | First Debug Command |
|---|---|---|
Pending | Not yet scheduled to a node | kubectl describe pod <name> → check Events |
ContainerCreating | Image pulling or volume mounting | kubectl describe pod <name> → check Events |
ImagePullBackOff | Can't pull the container image | kubectl describe pod <name> → check image name & ACR auth |
CrashLoopBackOff | Container starts then crashes | kubectl logs <name> --previous |
Running but not Ready | Readiness probe failing | kubectl describe pod <name> → check probe config |
Evicted | Node ran out of resources | kubectl describe pod <name> → check reason field |
Terminating (stuck) | Finalizer or graceful shutdown hang | kubectl get pod <name> -o yaml | grep finaliz |
Deep Pod Inspection
bash
# Full describe — shows events, conditions, volumes, env kubectl describe pod my-pod -n myapp # View container logs (current run) kubectl logs my-pod -n myapp # View previous container logs (after a crash) kubectl logs my-pod -n myapp --previous # Follow logs in real-time kubectl logs my-pod -n myapp -f # Logs from a specific container in a multi-container pod kubectl logs my-pod -n myapp -c sidecar # View logs from ALL pods with a label kubectl logs -n myapp -l app=myapp --all-containers --tail=50
Exec Into a Running Container
bash
# Get a shell inside the container kubectl exec -it my-pod -n myapp -- /bin/sh # Quick checks from inside the container: # 1. Check environment variables env | sort # 2. Check DNS resolution nslookup redis-service.myapp.svc.cluster.local # 3. Check connectivity to another service wget -qO- http://backend-service:8080/health # 4. Check filesystem df -h ls -la /app/ # 5. Check running processes ps aux
🔧 Level 4 — Network Debugging
DNS Debugging
bash
# Deploy a debug pod with network tools kubectl run netdebug --image=nicolaka/netshoot -it --rm --restart=Never -- bash # From inside the debug pod: # 1. Test DNS resolution nslookup kubernetes.default.svc.cluster.local nslookup myservice.myapp.svc.cluster.local # 2. Test connectivity to a service curl -v http://myservice.myapp.svc.cluster.local:80/ # 3. Test connectivity to external endpoints curl -v https://mcr.microsoft.com # 4. Check DNS config cat /etc/resolv.conf # nameserver 10.0.0.10 (CoreDNS ClusterIP) # search myapp.svc.cluster.local svc.cluster.local cluster.local
Service & Endpoint Debugging
bash
# Check if service has endpoints (i.e., backing pods exist)
kubectl get endpoints myservice -n myapp
# NAME ENDPOINTS AGE
# myservice 10.244.1.5:3000,10.244.2.3:3000 1h
# If ENDPOINTS is empty, the selector doesn't match any pods
# Compare service selector with pod labels:
kubectl get svc myservice -n myapp -o jsonpath='{.spec.selector}'
# {"app":"myapp"}
kubectl get pods -n myapp -l app=myapp
# (should show matching pods)
# Check Ingress
kubectl get ingress -n myapp
kubectl describe ingress myingress -n myapp🔧 Level 5 — Azure-Level Diagnostics
bash
# Check AKS cluster resource health
az aks show --resource-group $RG --name $CLUSTER \
--query "{fqdn:fqdn,state:provisioningState,power:powerState.code}" -o table
# List node pool details
az aks nodepool list --resource-group $RG --cluster-name $CLUSTER \
--query "[].{name:name,count:count,vmSize:vmSize,mode:mode,state:provisioningState}" -o table
# Check for failed Azure operations on the MC_ resource group
MC_RG=$(az aks show --resource-group $RG --name $CLUSTER --query nodeResourceGroup -o tsv)
az monitor activity-log list --resource-group $MC_RG \
--status Failed --offset 1h -o table
# Check Azure Load Balancer health (for Service type LoadBalancer)
az network lb show --resource-group $MC_RG \
--name kubernetes --query "frontendIPConfigurations[].{name:name,ip:privateIPAddress}" -o table
# Run AKS diagnostics
az aks check-acr --resource-group $RG --name $CLUSTER --acr myacr.azurecr.io
# Get kubelet logs from a specific node via Azure CLI
az aks command invoke --resource-group $RG --name $CLUSTER \
--command "kubectl logs -n kube-system -l component=kube-proxy --tail=50"🏭 Real-World Debugging Walkthrough
Scenario: "The App Deployed But Users Get 502 Bad Gateway"
bash
# Step 1: Check if pods are running kubectl get pods -n production -l app=web # All 3 pods show Running and 1/1 Ready — so the app is up # Step 2: Check the Ingress kubectl describe ingress web-ingress -n production # Shows backends: 10.244.1.5:3000, 10.244.2.3:3000, 10.244.3.7:3000 # Step 3: Check NGINX Ingress Controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=20 # [error] upstream prematurely closed connection while reading response header # Step 4: The app is crashing during request processing — check app logs kubectl logs -n production -l app=web --tail=20 # Error: connect ECONNREFUSED 10.0.0.5:5432 — can't reach the database! # Root cause: Database service endpoint changed. Fix the connection string. kubectl get svc postgres -n production # The service ClusterIP is 10.0.0.5 — but the database pod has no matching label kubectl get pods -n production -l app=postgres # No resources found — the postgres pod was evicted! kubectl get events -n production --sort-by='.lastTimestamp' | grep postgres # Evicted due to node disk pressure # Fix: Restart postgres and add resource requests to prevent eviction kubectl rollout restart deployment postgres -n production
📋 Diagnostic Commands Cheat Sheet
| What | Command |
|---|---|
| Cluster health | kubectl cluster-info |
| Node status | kubectl get nodes -o wide |
| Node resource usage | kubectl top nodes |
| All resources in namespace | kubectl get all -n <ns> |
| Recent events | kubectl get events --sort-by='.lastTimestamp' -n <ns> |
| Pod details | kubectl describe pod <name> -n <ns> |
| Current logs | kubectl logs <pod> -n <ns> |
| Previous crash logs | kubectl logs <pod> -n <ns> --previous |
| Shell into container | kubectl exec -it <pod> -n <ns> -- /bin/sh |
| DNS test | kubectl run test --image=busybox --rm -it -- nslookup <svc> |
| Service endpoints | kubectl get endpoints <svc> -n <ns> |
| AKS state | az aks show -g <rg> -n <aks> --query provisioningState |
| ACR auth check | az aks check-acr -g <rg> -n <aks> --acr <acr>.azurecr.io |
| Azure activity log | az monitor activity-log list -g <mc_rg> --status Failed |