Golden Rule
Always start from the outside in. Check cluster health before looking at individual pods. A node problem will cause cascading pod failures — fixing the node fixes all the pods.
A systematic debugging toolkit for AKS. Learn the exact commands, order of operations, and mental models that SREs use to diagnose production issues on Azure Kubernetes Service.
Production debugging is not guessing. It's a systematic narrowing process:
Always start from the outside in. Check cluster health before looking at individual pods. A node problem will cause cascading pod failures — fixing the node fixes all the pods.
# Quick health check — if this hangs, the control plane is down kubectl cluster-info # Kubernetes control plane is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443 # CoreDNS is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443/api/v1/... # Check API server health directly kubectl get --raw='/healthz' # ok # Check component status (deprecated but still useful) kubectl get componentstatuses
# List all nodes with status kubectl get nodes -o wide # Check node conditions — look for MemoryPressure, DiskPressure, PIDPressure kubectl describe nodes | grep -A15 "Conditions:" # Quick check: are any nodes NotReady? kubectl get nodes | grep -v " Ready " # Check node resource usage kubectl top nodes # NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% # aks-nodepool1-xxxxx-vmss0000 245m 12% 1834Mi 48% # aks-nodepool1-xxxxx-vmss0001 189m 9% 1562Mi 41%
# Check AKS cluster provisioning state
az aks show --resource-group $RG --name $CLUSTER \
--query "{state:provisioningState,powerState:powerState.code,k8sVersion:kubernetesVersion}" -o table
# State PowerState K8sVersion
# Succeeded Running 1.29.4
# Check for ongoing operations
az aks show --resource-group $RG --name $CLUSTER \
--query "provisioningState" -o tsv
# Get cluster diagnostic logs
az aks get-credentials --resource-group $RG --name $CLUSTER
az aks kollect --resource-group $RG --name $CLUSTER # Captures diagnostics bundle# Get everything in a namespace at a glance kubectl get all -n myapp # Show recent events (MOST USEFUL COMMAND for debugging) kubectl get events -n myapp --sort-by='.lastTimestamp' | tail -20 # Find pods NOT in Running state kubectl get pods -n myapp --field-selector 'status.phase!=Running' # Get resource quota usage (if quotas are set) kubectl describe resourcequota -n myapp
kubectl get events --sort-by='.lastTimestamp' is the single most useful debugging command. Events show scheduling failures, image pulls, volume mounts, liveness probe failures, and OOM kills — all in chronological order.
| Status | What It Means | First Debug Command |
|---|---|---|
Pending | Not yet scheduled to a node | kubectl describe pod <name> → check Events |
ContainerCreating | Image pulling or volume mounting | kubectl describe pod <name> → check Events |
ImagePullBackOff | Can't pull the container image | kubectl describe pod <name> → check image name & ACR auth |
CrashLoopBackOff | Container starts then crashes | kubectl logs <name> --previous |
Running but not Ready | Readiness probe failing | kubectl describe pod <name> → check probe config |
Evicted | Node ran out of resources | kubectl describe pod <name> → check reason field |
Terminating (stuck) | Finalizer or graceful shutdown hang | kubectl get pod <name> -o yaml | grep finaliz |
# Full describe — shows events, conditions, volumes, env kubectl describe pod my-pod -n myapp # View container logs (current run) kubectl logs my-pod -n myapp # View previous container logs (after a crash) kubectl logs my-pod -n myapp --previous # Follow logs in real-time kubectl logs my-pod -n myapp -f # Logs from a specific container in a multi-container pod kubectl logs my-pod -n myapp -c sidecar # View logs from ALL pods with a label kubectl logs -n myapp -l app=myapp --all-containers --tail=50
# Get a shell inside the container kubectl exec -it my-pod -n myapp -- /bin/sh # Quick checks from inside the container: # 1. Check environment variables env | sort # 2. Check DNS resolution nslookup redis-service.myapp.svc.cluster.local # 3. Check connectivity to another service wget -qO- http://backend-service:8080/health # 4. Check filesystem df -h ls -la /app/ # 5. Check running processes ps aux
# Deploy a debug pod with network tools kubectl run netdebug --image=nicolaka/netshoot -it --rm --restart=Never -- bash # From inside the debug pod: # 1. Test DNS resolution nslookup kubernetes.default.svc.cluster.local nslookup myservice.myapp.svc.cluster.local # 2. Test connectivity to a service curl -v http://myservice.myapp.svc.cluster.local:80/ # 3. Test connectivity to external endpoints curl -v https://mcr.microsoft.com # 4. Check DNS config cat /etc/resolv.conf # nameserver 10.0.0.10 (CoreDNS ClusterIP) # search myapp.svc.cluster.local svc.cluster.local cluster.local
# Check if service has endpoints (i.e., backing pods exist)
kubectl get endpoints myservice -n myapp
# NAME ENDPOINTS AGE
# myservice 10.244.1.5:3000,10.244.2.3:3000 1h
# If ENDPOINTS is empty, the selector doesn't match any pods
# Compare service selector with pod labels:
kubectl get svc myservice -n myapp -o jsonpath='{.spec.selector}'
# {"app":"myapp"}
kubectl get pods -n myapp -l app=myapp
# (should show matching pods)
# Check Ingress
kubectl get ingress -n myapp
kubectl describe ingress myingress -n myapp# Check AKS cluster resource health
az aks show --resource-group $RG --name $CLUSTER \
--query "{fqdn:fqdn,state:provisioningState,power:powerState.code}" -o table
# List node pool details
az aks nodepool list --resource-group $RG --cluster-name $CLUSTER \
--query "[].{name:name,count:count,vmSize:vmSize,mode:mode,state:provisioningState}" -o table
# Check for failed Azure operations on the MC_ resource group
MC_RG=$(az aks show --resource-group $RG --name $CLUSTER --query nodeResourceGroup -o tsv)
az monitor activity-log list --resource-group $MC_RG \
--status Failed --offset 1h -o table
# Check Azure Load Balancer health (for Service type LoadBalancer)
az network lb show --resource-group $MC_RG \
--name kubernetes --query "frontendIPConfigurations[].{name:name,ip:privateIPAddress}" -o table
# Run AKS diagnostics
az aks check-acr --resource-group $RG --name $CLUSTER --acr myacr.azurecr.io
# Get kubelet logs from a specific node via Azure CLI
az aks command invoke --resource-group $RG --name $CLUSTER \
--command "kubectl logs -n kube-system -l component=kube-proxy --tail=50"# Step 1: Check if pods are running kubectl get pods -n production -l app=web # All 3 pods show Running and 1/1 Ready — so the app is up # Step 2: Check the Ingress kubectl describe ingress web-ingress -n production # Shows backends: 10.244.1.5:3000, 10.244.2.3:3000, 10.244.3.7:3000 # Step 3: Check NGINX Ingress Controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=20 # [error] upstream prematurely closed connection while reading response header # Step 4: The app is crashing during request processing — check app logs kubectl logs -n production -l app=web --tail=20 # Error: connect ECONNREFUSED 10.0.0.5:5432 — can't reach the database! # Root cause: Database service endpoint changed. Fix the connection string. kubectl get svc postgres -n production # The service ClusterIP is 10.0.0.5 — but the database pod has no matching label kubectl get pods -n production -l app=postgres # No resources found — the postgres pod was evicted! kubectl get events -n production --sort-by='.lastTimestamp' | grep postgres # Evicted due to node disk pressure # Fix: Restart postgres and add resource requests to prevent eviction kubectl rollout restart deployment postgres -n production
| What | Command |
|---|---|
| Cluster health | kubectl cluster-info |
| Node status | kubectl get nodes -o wide |
| Node resource usage | kubectl top nodes |
| All resources in namespace | kubectl get all -n <ns> |
| Recent events | kubectl get events --sort-by='.lastTimestamp' -n <ns> |
| Pod details | kubectl describe pod <name> -n <ns> |
| Current logs | kubectl logs <pod> -n <ns> |
| Previous crash logs | kubectl logs <pod> -n <ns> --previous |
| Shell into container | kubectl exec -it <pod> -n <ns> -- /bin/sh |
| DNS test | kubectl run test --image=busybox --rm -it -- nslookup <svc> |
| Service endpoints | kubectl get endpoints <svc> -n <ns> |
| AKS state | az aks show -g <rg> -n <aks> --query provisioningState |
| ACR auth check | az aks check-acr -g <rg> -n <aks> --acr <acr>.azurecr.io |
| Azure activity log | az monitor activity-log list -g <mc_rg> --status Failed |