Break & Fix — AKS Failure Scenarios
Intentionally break your AKS cluster in controlled ways, then diagnose and fix each issue. This is the fastest way to build real troubleshooting muscle.
🎯 How This Lab Works
Each scenario follows the same pattern:
- Break it — We give you a command that introduces a specific failure
- Observe it — See what the symptoms look like (the same symptoms you'd see in production)
- Diagnose it — Use kubectl, az CLI, and logs to find the root cause
- Fix it — Apply the correct fix and verify recovery
Run these exercises on a disposable lab cluster, not a production environment. Some scenarios involve deleting resources, corrupting configs, and draining nodes.
💥 Scenario 1 — ImagePullBackOff
The #1 most common AKS deployment failure. The pod can't pull the container image.
Break It
# Deploy a pod with a typo in the image name kubectl create namespace break-fix kubectl run broken-app --namespace break-fix \ --image=myacr.azurecr.io/does-not-exist:v1
Observe It
kubectl get pods -n break-fix -w # NAME READY STATUS RESTARTS AGE # broken-app 0/1 ImagePullBackOff 0 30s kubectl describe pod broken-app -n break-fix | grep -A5 "Events:" # Events: # Warning Failed kubelet Failed to pull image "myacr.azurecr.io/does-not-exist:v1": # rpc error: code = NotFound desc = failed to pull and unpack image: # failed to resolve reference: myacr.azurecr.io/does-not-exist:v1: not found
Diagnose It
Three common causes of ImagePullBackOff on AKS:
| Cause | How to Check | Fix |
|---|---|---|
| Wrong image name/tag | az acr repository list --name myacr | Fix the image reference in the deployment |
| AKS can't auth to ACR | az aks check-acr --resource-group rg --name aks --acr myacr.azurecr.io | az aks update --attach-acr myacr |
| ACR is in a different subscription/tenant | Check role assignments on ACR | Create a Kubernetes pull secret instead |
Fix It
# Fix 1: Correct the image reference kubectl set image pod/broken-app broken-app=myacr.azurecr.io/real-app:v1 -n break-fix # Fix 2: If it's an auth issue, re-attach ACR az aks update \ --resource-group $RESOURCE_GROUP \ --name $CLUSTER_NAME \ --attach-acr $ACR_NAME # Verify fix kubectl get pods -n break-fix # NAME READY STATUS RESTARTS AGE # broken-app 1/1 Running 0 10s
💥 Scenario 2 — CrashLoopBackOff
The container starts but immediately crashes, over and over.
Break It
# Deploy a pod with a bad command that exits immediately kubectl run crash-app --namespace break-fix \ --image=alpine \ --command -- sh -c "echo 'starting...'; exit 1"
Observe It
kubectl get pods -n break-fix -w # NAME READY STATUS RESTARTS AGE # crash-app 0/1 CrashLoopBackOff 3 (30s ago) 2m # Check previous container logs kubectl logs crash-app -n break-fix --previous # starting... # Describe shows the exit code kubectl describe pod crash-app -n break-fix | grep -A3 "Last State" # Last State: Terminated # Reason: Error # Exit Code: 1
Key Diagnostic Commands
| Exit Code | Meaning | Common Cause |
|---|---|---|
| 0 | Success (but container should keep running) | Missing CMD or process exits cleanly |
| 1 | Application error | Missing env var, bad config, unhandled exception |
| 137 | OOMKilled | Container exceeded memory limits |
| 139 | Segfault | Native code crash, corrupted binary |
Fix It
# Delete the broken pod and deploy correctly kubectl delete pod crash-app -n break-fix kubectl run crash-app --namespace break-fix \ --image=alpine \ --command -- sh -c "echo 'healthy'; sleep infinity" # If it was OOMKilled (exit 137), increase memory limits: # kubectl set resources pod/crash-app -n break-fix --limits=memory=512Mi
💥 Scenario 3 — Node NotReady
A worker node becomes NotReady — pods get evicted, scheduling fails.
Break It
# Cordon and drain a node to simulate it going NotReady
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl cordon $NODE
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --forceObserve It
kubectl get nodes # NAME STATUS ROLES AGE # aks-nodepool1-xxxxx-vmss0000 Ready,SchedulingDisabled agent 1h # aks-nodepool1-xxxxx-vmss0001 Ready agent 1h # Pods that were on the drained node are now Pending or rescheduled kubectl get pods -A -o wide | grep -v Running
Diagnose It
# Check node conditions
kubectl describe node $NODE | grep -A10 "Conditions:"
# Check for real NotReady scenarios
az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME \
--query "agentPoolProfiles[].{name:name,count:count,powerState:powerState}" -o table
# Check VMSS instance health in Azure
az vmss list-instances --resource-group MC_${RESOURCE_GROUP}_${CLUSTER_NAME}_${LOCATION} \
--name $(az vmss list --resource-group MC_${RESOURCE_GROUP}_${CLUSTER_NAME}_${LOCATION} -o tsv --query "[0].name") \
--query "[].{id:instanceId,state:provisioningState}" -o tableFix It
# Uncordon the node (allow scheduling again) kubectl uncordon $NODE # Verify it's Ready kubectl get nodes # NAME STATUS ROLES AGE # aks-nodepool1-xxxxx-vmss0000 Ready agent 1h # aks-nodepool1-xxxxx-vmss0001 Ready agent 1h # If a node is truly dead, let AKS replace it: # az aks nodepool scale --resource-group $RESOURCE_GROUP --cluster-name $CLUSTER_NAME \ # --name nodepool1 --node-count 2
💥 Scenario 4 — Service Not Reachable (DNS Failure)
The app deploys fine but can't connect to other services by name.
Break It
# Scale CoreDNS to 0 replicas (kills internal DNS) kubectl scale deployment coredns -n kube-system --replicas=0
Observe It
# Try DNS resolution from inside a pod kubectl run dns-test --namespace break-fix --image=busybox --rm -it --restart=Never \ -- nslookup kubernetes.default.svc.cluster.local # ;; connection timed out; no servers could be reached # Your app logs will show connection errors like: # Error: getaddrinfo ENOTFOUND redis
Diagnose It
# Check if CoreDNS is running kubectl get pods -n kube-system -l k8s-app=kube-dns # No resources found in kube-system namespace. <-- That's the problem! # Check CoreDNS deployment kubectl get deployment coredns -n kube-system # NAME READY UP-TO-DATE AVAILABLE AGE # coredns 0/0 0 0 1h
Fix It
# Restore CoreDNS kubectl scale deployment coredns -n kube-system --replicas=2 # Verify DNS works again kubectl run dns-test2 --namespace break-fix --image=busybox --rm -it --restart=Never \ -- nslookup kubernetes.default.svc.cluster.local # Name: kubernetes.default.svc.cluster.local # Address: 10.0.0.1
💥 Scenario 5 — RBAC Denied
A user or service account gets Forbidden errors when trying to access resources.
Break It
# Create a service account with NO permissions kubectl create serviceaccount restricted-sa -n break-fix # Try to list pods using that service account kubectl auth can-i list pods -n break-fix --as=system:serviceaccount:break-fix:restricted-sa # no
Diagnose It
# Check what the SA can do kubectl auth can-i --list --as=system:serviceaccount:break-fix:restricted-sa -n break-fix # Resources Verbs # -------- ----- # (nothing meaningful listed) # Check existing role bindings kubectl get rolebindings -n break-fix kubectl get clusterrolebindings | grep break-fix
Fix It
# Create a Role that allows reading pods kubectl create role pod-reader -n break-fix \ --verb=get,list,watch \ --resource=pods # Bind it to the service account kubectl create rolebinding pod-reader-binding -n break-fix \ --role=pod-reader \ --serviceaccount=break-fix:restricted-sa # Verify kubectl auth can-i list pods -n break-fix --as=system:serviceaccount:break-fix:restricted-sa # yes
💥 Scenario 6 — Pending Pods (Insufficient Resources)
Break It
# Request more CPU than any node has available
kubectl run greedy-pod --namespace break-fix \
--image=nginx \
--overrides='{"spec":{"containers":[{"name":"greedy","image":"nginx","resources":{"requests":{"cpu":"64","memory":"128Gi"}}}]}}'Observe & Diagnose
kubectl get pods -n break-fix # NAME READY STATUS AGE # greedy-pod 0/1 Pending 1m kubectl describe pod greedy-pod -n break-fix | grep -A3 "Events:" # Events: # Warning FailedScheduling default-scheduler # 0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory. # Check actual node capacity kubectl describe nodes | grep -A5 "Allocatable:"
Fix It
# Option 1: Reduce the resource request
kubectl delete pod greedy-pod -n break-fix
kubectl run greedy-pod --namespace break-fix --image=nginx \
--overrides='{"spec":{"containers":[{"name":"greedy","image":"nginx","resources":{"requests":{"cpu":"100m","memory":"128Mi"}}}]}}'
# Option 2: Scale up the node pool
az aks nodepool scale \
--resource-group $RESOURCE_GROUP \
--cluster-name $CLUSTER_NAME \
--name nodepool1 \
--node-count 3🧹 Cleanup
kubectl delete namespace break-fix # Make sure CoreDNS is scaled back up if you broke it: kubectl scale deployment coredns -n kube-system --replicas=2
📋 Break & Fix Quick Reference
| Symptom | First Command | Likely Cause |
|---|---|---|
| ImagePullBackOff | kubectl describe pod | Wrong image, ACR auth, or image doesn't exist |
| CrashLoopBackOff | kubectl logs --previous | App crash, missing env var, OOMKilled |
| Pending | kubectl describe pod (Events) | Insufficient CPU/memory, node affinity mismatch |
| Node NotReady | kubectl describe node | VM issue, kubelet crash, disk pressure |
| DNS failures | kubectl get pods -n kube-system | CoreDNS down or misconfigured |
| 403 Forbidden | kubectl auth can-i --list | Missing RBAC role/binding |