Hands-on Lesson 12 of 14

Break & Fix — AKS Failure Scenarios

Intentionally break your AKS cluster in controlled ways, then diagnose and fix each issue. This is the fastest way to build real troubleshooting muscle.

🎯 How This Lab Works

Each scenario follows the same pattern:

  1. Break it — We give you a command that introduces a specific failure
  2. Observe it — See what the symptoms look like (the same symptoms you'd see in production)
  3. Diagnose it — Use kubectl, az CLI, and logs to find the root cause
  4. Fix it — Apply the correct fix and verify recovery
Use a Lab Cluster

Run these exercises on a disposable lab cluster, not a production environment. Some scenarios involve deleting resources, corrupting configs, and draining nodes.

💥 Scenario 1 — ImagePullBackOff

The #1 most common AKS deployment failure. The pod can't pull the container image.

Break It

bash
# Deploy a pod with a typo in the image name
kubectl create namespace break-fix
kubectl run broken-app --namespace break-fix \
  --image=myacr.azurecr.io/does-not-exist:v1

Observe It

bash
kubectl get pods -n break-fix -w
# NAME         READY   STATUS             RESTARTS   AGE
# broken-app   0/1     ImagePullBackOff   0          30s

kubectl describe pod broken-app -n break-fix | grep -A5 "Events:"
# Events:
#   Warning  Failed   kubelet  Failed to pull image "myacr.azurecr.io/does-not-exist:v1":
#            rpc error: code = NotFound desc = failed to pull and unpack image:
#            failed to resolve reference: myacr.azurecr.io/does-not-exist:v1: not found

Diagnose It

Three common causes of ImagePullBackOff on AKS:

CauseHow to CheckFix
Wrong image name/tagaz acr repository list --name myacrFix the image reference in the deployment
AKS can't auth to ACRaz aks check-acr --resource-group rg --name aks --acr myacr.azurecr.ioaz aks update --attach-acr myacr
ACR is in a different subscription/tenantCheck role assignments on ACRCreate a Kubernetes pull secret instead

Fix It

bash
# Fix 1: Correct the image reference
kubectl set image pod/broken-app broken-app=myacr.azurecr.io/real-app:v1 -n break-fix

# Fix 2: If it's an auth issue, re-attach ACR
az aks update \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --attach-acr $ACR_NAME

# Verify fix
kubectl get pods -n break-fix
# NAME         READY   STATUS    RESTARTS   AGE
# broken-app   1/1     Running   0          10s

💥 Scenario 2 — CrashLoopBackOff

The container starts but immediately crashes, over and over.

Break It

bash
# Deploy a pod with a bad command that exits immediately
kubectl run crash-app --namespace break-fix \
  --image=alpine \
  --command -- sh -c "echo 'starting...'; exit 1"

Observe It

bash
kubectl get pods -n break-fix -w
# NAME        READY   STATUS             RESTARTS      AGE
# crash-app   0/1     CrashLoopBackOff   3 (30s ago)   2m

# Check previous container logs
kubectl logs crash-app -n break-fix --previous
# starting...

# Describe shows the exit code
kubectl describe pod crash-app -n break-fix | grep -A3 "Last State"
#     Last State:  Terminated
#       Reason:    Error
#       Exit Code: 1

Key Diagnostic Commands

Exit CodeMeaningCommon Cause
0Success (but container should keep running)Missing CMD or process exits cleanly
1Application errorMissing env var, bad config, unhandled exception
137OOMKilledContainer exceeded memory limits
139SegfaultNative code crash, corrupted binary

Fix It

bash
# Delete the broken pod and deploy correctly
kubectl delete pod crash-app -n break-fix
kubectl run crash-app --namespace break-fix \
  --image=alpine \
  --command -- sh -c "echo 'healthy'; sleep infinity"

# If it was OOMKilled (exit 137), increase memory limits:
# kubectl set resources pod/crash-app -n break-fix --limits=memory=512Mi

💥 Scenario 3 — Node NotReady

A worker node becomes NotReady — pods get evicted, scheduling fails.

Break It

bash
# Cordon and drain a node to simulate it going NotReady
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl cordon $NODE
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --force

Observe It

bash
kubectl get nodes
# NAME                              STATUS                     ROLES   AGE
# aks-nodepool1-xxxxx-vmss0000      Ready,SchedulingDisabled   agent   1h
# aks-nodepool1-xxxxx-vmss0001      Ready                      agent   1h

# Pods that were on the drained node are now Pending or rescheduled
kubectl get pods -A -o wide | grep -v Running

Diagnose It

bash
# Check node conditions
kubectl describe node $NODE | grep -A10 "Conditions:"

# Check for real NotReady scenarios
az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME \
  --query "agentPoolProfiles[].{name:name,count:count,powerState:powerState}" -o table

# Check VMSS instance health in Azure
az vmss list-instances --resource-group MC_${RESOURCE_GROUP}_${CLUSTER_NAME}_${LOCATION} \
  --name $(az vmss list --resource-group MC_${RESOURCE_GROUP}_${CLUSTER_NAME}_${LOCATION} -o tsv --query "[0].name") \
  --query "[].{id:instanceId,state:provisioningState}" -o table

Fix It

bash
# Uncordon the node (allow scheduling again)
kubectl uncordon $NODE

# Verify it's Ready
kubectl get nodes
# NAME                              STATUS   ROLES   AGE
# aks-nodepool1-xxxxx-vmss0000      Ready    agent   1h
# aks-nodepool1-xxxxx-vmss0001      Ready    agent   1h

# If a node is truly dead, let AKS replace it:
# az aks nodepool scale --resource-group $RESOURCE_GROUP --cluster-name $CLUSTER_NAME \
#   --name nodepool1 --node-count 2

💥 Scenario 4 — Service Not Reachable (DNS Failure)

The app deploys fine but can't connect to other services by name.

Break It

bash
# Scale CoreDNS to 0 replicas (kills internal DNS)
kubectl scale deployment coredns -n kube-system --replicas=0

Observe It

bash
# Try DNS resolution from inside a pod
kubectl run dns-test --namespace break-fix --image=busybox --rm -it --restart=Never \
  -- nslookup kubernetes.default.svc.cluster.local
# ;; connection timed out; no servers could be reached

# Your app logs will show connection errors like:
# Error: getaddrinfo ENOTFOUND redis

Diagnose It

bash
# Check if CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# No resources found in kube-system namespace.  <-- That's the problem!

# Check CoreDNS deployment
kubectl get deployment coredns -n kube-system
# NAME      READY   UP-TO-DATE   AVAILABLE   AGE
# coredns   0/0     0            0           1h

Fix It

bash
# Restore CoreDNS
kubectl scale deployment coredns -n kube-system --replicas=2

# Verify DNS works again
kubectl run dns-test2 --namespace break-fix --image=busybox --rm -it --restart=Never \
  -- nslookup kubernetes.default.svc.cluster.local
# Name:    kubernetes.default.svc.cluster.local
# Address: 10.0.0.1

💥 Scenario 5 — RBAC Denied

A user or service account gets Forbidden errors when trying to access resources.

Break It

bash
# Create a service account with NO permissions
kubectl create serviceaccount restricted-sa -n break-fix

# Try to list pods using that service account
kubectl auth can-i list pods -n break-fix --as=system:serviceaccount:break-fix:restricted-sa
# no

Diagnose It

bash
# Check what the SA can do
kubectl auth can-i --list --as=system:serviceaccount:break-fix:restricted-sa -n break-fix
# Resources   Verbs
# --------    -----
# (nothing meaningful listed)

# Check existing role bindings
kubectl get rolebindings -n break-fix
kubectl get clusterrolebindings | grep break-fix

Fix It

bash
# Create a Role that allows reading pods
kubectl create role pod-reader -n break-fix \
  --verb=get,list,watch \
  --resource=pods

# Bind it to the service account
kubectl create rolebinding pod-reader-binding -n break-fix \
  --role=pod-reader \
  --serviceaccount=break-fix:restricted-sa

# Verify
kubectl auth can-i list pods -n break-fix --as=system:serviceaccount:break-fix:restricted-sa
# yes

💥 Scenario 6 — Pending Pods (Insufficient Resources)

Break It

bash
# Request more CPU than any node has available
kubectl run greedy-pod --namespace break-fix \
  --image=nginx \
  --overrides='{"spec":{"containers":[{"name":"greedy","image":"nginx","resources":{"requests":{"cpu":"64","memory":"128Gi"}}}]}}'

Observe & Diagnose

bash
kubectl get pods -n break-fix
# NAME         READY   STATUS    AGE
# greedy-pod   0/1     Pending   1m

kubectl describe pod greedy-pod -n break-fix | grep -A3 "Events:"
# Events:
#   Warning  FailedScheduling  default-scheduler
#   0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.

# Check actual node capacity
kubectl describe nodes | grep -A5 "Allocatable:"

Fix It

bash
# Option 1: Reduce the resource request
kubectl delete pod greedy-pod -n break-fix
kubectl run greedy-pod --namespace break-fix --image=nginx \
  --overrides='{"spec":{"containers":[{"name":"greedy","image":"nginx","resources":{"requests":{"cpu":"100m","memory":"128Mi"}}}]}}'

# Option 2: Scale up the node pool
az aks nodepool scale \
  --resource-group $RESOURCE_GROUP \
  --cluster-name $CLUSTER_NAME \
  --name nodepool1 \
  --node-count 3

🧹 Cleanup

bash
kubectl delete namespace break-fix
# Make sure CoreDNS is scaled back up if you broke it:
kubectl scale deployment coredns -n kube-system --replicas=2

📋 Break & Fix Quick Reference

SymptomFirst CommandLikely Cause
ImagePullBackOffkubectl describe podWrong image, ACR auth, or image doesn't exist
CrashLoopBackOffkubectl logs --previousApp crash, missing env var, OOMKilled
Pendingkubectl describe pod (Events)Insufficient CPU/memory, node affinity mismatch
Node NotReadykubectl describe nodeVM issue, kubelet crash, disk pressure
DNS failureskubectl get pods -n kube-systemCoreDNS down or misconfigured
403 Forbiddenkubectl auth can-i --listMissing RBAC role/binding
← Back to AKS Course