Hands-on Lesson 12 of 14

Break & Fix — AKS Failure Scenarios

Intentionally break your AKS cluster in controlled ways, then diagnose and fix each issue. This is the fastest way to build real troubleshooting muscle.

🎯 How This Lab Works

Each scenario follows the same pattern:

Break it — We give you a command that introduces a specific failure
Observe it — See what the symptoms look like (the same symptoms you'd see in production)
Diagnose it — Use kubectl, az CLI, and logs to find the root cause
Fix it — Apply the correct fix and verify recovery

❗

Use a Lab Cluster

Run these exercises on a disposable lab cluster, not a production environment. Some scenarios involve deleting resources, corrupting configs, and draining nodes.

💥 Scenario 1 — ImagePullBackOff

The #1 most common AKS deployment failure. The pod can't pull the container image.

Break It

bash

# Deploy a pod with a typo in the image name
kubectl create namespace break-fix
kubectl run broken-app --namespace break-fix \
  --image=myacr.azurecr.io/does-not-exist:v1

Observe It

bash

kubectl get pods -n break-fix -w
# NAME         READY   STATUS             RESTARTS   AGE
# broken-app   0/1     ImagePullBackOff   0          30s

kubectl describe pod broken-app -n break-fix | grep -A5 "Events:"
# Events:
#   Warning  Failed   kubelet  Failed to pull image "myacr.azurecr.io/does-not-exist:v1":
#            rpc error: code = NotFound desc = failed to pull and unpack image:
#            failed to resolve reference: myacr.azurecr.io/does-not-exist:v1: not found

Diagnose It

Three common causes of ImagePullBackOff on AKS:

Cause	How to Check	Fix
Wrong image name/tag	`az acr repository list --name myacr`	Fix the image reference in the deployment
AKS can't auth to ACR	`az aks check-acr --resource-group rg --name aks --acr myacr.azurecr.io`	`az aks update --attach-acr myacr`
ACR is in a different subscription/tenant	Check role assignments on ACR	Create a Kubernetes pull secret instead

Fix It

bash

# Fix 1: Correct the image reference
kubectl set image pod/broken-app broken-app=myacr.azurecr.io/real-app:v1 -n break-fix

# Fix 2: If it's an auth issue, re-attach ACR
az aks update \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --attach-acr $ACR_NAME

# Verify fix
kubectl get pods -n break-fix
# NAME         READY   STATUS    RESTARTS   AGE
# broken-app   1/1     Running   0          10s

💥 Scenario 2 — CrashLoopBackOff

The container starts but immediately crashes, over and over.

Break It

bash

# Deploy a pod with a bad command that exits immediately
kubectl run crash-app --namespace break-fix \
  --image=alpine \
  --command -- sh -c "echo 'starting...'; exit 1"

Observe It

bash

kubectl get pods -n break-fix -w
# NAME        READY   STATUS             RESTARTS      AGE
# crash-app   0/1     CrashLoopBackOff   3 (30s ago)   2m

# Check previous container logs
kubectl logs crash-app -n break-fix --previous
# starting...

# Describe shows the exit code
kubectl describe pod crash-app -n break-fix | grep -A3 "Last State"
#     Last State:  Terminated
#       Reason:    Error
#       Exit Code: 1

Key Diagnostic Commands

Exit Code	Meaning	Common Cause
0	Success (but container should keep running)	Missing CMD or process exits cleanly
1	Application error	Missing env var, bad config, unhandled exception
137	OOMKilled	Container exceeded memory limits
139	Segfault	Native code crash, corrupted binary

Fix It

bash

# Delete the broken pod and deploy correctly
kubectl delete pod crash-app -n break-fix
kubectl run crash-app --namespace break-fix \
  --image=alpine \
  --command -- sh -c "echo 'healthy'; sleep infinity"

# If it was OOMKilled (exit 137), increase memory limits:
# kubectl set resources pod/crash-app -n break-fix --limits=memory=512Mi

💥 Scenario 3 — Node NotReady

A worker node becomes NotReady — pods get evicted, scheduling fails.

Break It

bash

# Cordon and drain a node to simulate it going NotReady
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl cordon $NODE
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data --force

Observe It

bash

kubectl get nodes
# NAME                              STATUS                     ROLES   AGE
# aks-nodepool1-xxxxx-vmss0000      Ready,SchedulingDisabled   agent   1h
# aks-nodepool1-xxxxx-vmss0001      Ready                      agent   1h

# Pods that were on the drained node are now Pending or rescheduled
kubectl get pods -A -o wide | grep -v Running

Diagnose It

bash

# Check node conditions
kubectl describe node $NODE | grep -A10 "Conditions:"

# Check for real NotReady scenarios
az aks show --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME \
  --query "agentPoolProfiles[].{name:name,count:count,powerState:powerState}" -o table

# Check VMSS instance health in Azure
az vmss list-instances --resource-group MC_${RESOURCE_GROUP}_${CLUSTER_NAME}_${LOCATION} \
  --name $(az vmss list --resource-group MC_${RESOURCE_GROUP}_${CLUSTER_NAME}_${LOCATION} -o tsv --query "[0].name") \
  --query "[].{id:instanceId,state:provisioningState}" -o table

Fix It

bash

# Uncordon the node (allow scheduling again)
kubectl uncordon $NODE

# Verify it's Ready
kubectl get nodes
# NAME                              STATUS   ROLES   AGE
# aks-nodepool1-xxxxx-vmss0000      Ready    agent   1h
# aks-nodepool1-xxxxx-vmss0001      Ready    agent   1h

# If a node is truly dead, let AKS replace it:
# az aks nodepool scale --resource-group $RESOURCE_GROUP --cluster-name $CLUSTER_NAME \
#   --name nodepool1 --node-count 2

💥 Scenario 4 — Service Not Reachable (DNS Failure)

The app deploys fine but can't connect to other services by name.

Break It

bash

# Scale CoreDNS to 0 replicas (kills internal DNS)
kubectl scale deployment coredns -n kube-system --replicas=0

Observe It

bash

# Try DNS resolution from inside a pod
kubectl run dns-test --namespace break-fix --image=busybox --rm -it --restart=Never \
  -- nslookup kubernetes.default.svc.cluster.local
# ;; connection timed out; no servers could be reached

# Your app logs will show connection errors like:
# Error: getaddrinfo ENOTFOUND redis

Diagnose It

bash

# Check if CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
# No resources found in kube-system namespace.  <-- That's the problem!

# Check CoreDNS deployment
kubectl get deployment coredns -n kube-system
# NAME      READY   UP-TO-DATE   AVAILABLE   AGE
# coredns   0/0     0            0           1h

Fix It

bash

# Restore CoreDNS
kubectl scale deployment coredns -n kube-system --replicas=2

# Verify DNS works again
kubectl run dns-test2 --namespace break-fix --image=busybox --rm -it --restart=Never \
  -- nslookup kubernetes.default.svc.cluster.local
# Name:    kubernetes.default.svc.cluster.local
# Address: 10.0.0.1

💥 Scenario 5 — RBAC Denied

A user or service account gets Forbidden errors when trying to access resources.

Break It

bash

# Create a service account with NO permissions
kubectl create serviceaccount restricted-sa -n break-fix

# Try to list pods using that service account
kubectl auth can-i list pods -n break-fix --as=system:serviceaccount:break-fix:restricted-sa
# no

Diagnose It

bash

# Check what the SA can do
kubectl auth can-i --list --as=system:serviceaccount:break-fix:restricted-sa -n break-fix
# Resources   Verbs
# --------    -----
# (nothing meaningful listed)

# Check existing role bindings
kubectl get rolebindings -n break-fix
kubectl get clusterrolebindings | grep break-fix

Fix It

bash

# Create a Role that allows reading pods
kubectl create role pod-reader -n break-fix \
  --verb=get,list,watch \
  --resource=pods

# Bind it to the service account
kubectl create rolebinding pod-reader-binding -n break-fix \
  --role=pod-reader \
  --serviceaccount=break-fix:restricted-sa

# Verify
kubectl auth can-i list pods -n break-fix --as=system:serviceaccount:break-fix:restricted-sa
# yes

💥 Scenario 6 — Pending Pods (Insufficient Resources)

Break It

bash

# Request more CPU than any node has available
kubectl run greedy-pod --namespace break-fix \
  --image=nginx \
  --overrides='{"spec":{"containers":[{"name":"greedy","image":"nginx","resources":{"requests":{"cpu":"64","memory":"128Gi"}}}]}}'

Observe & Diagnose

bash

kubectl get pods -n break-fix
# NAME         READY   STATUS    AGE
# greedy-pod   0/1     Pending   1m

kubectl describe pod greedy-pod -n break-fix | grep -A3 "Events:"
# Events:
#   Warning  FailedScheduling  default-scheduler
#   0/2 nodes are available: 2 Insufficient cpu, 2 Insufficient memory.

# Check actual node capacity
kubectl describe nodes | grep -A5 "Allocatable:"

Fix It

bash

# Option 1: Reduce the resource request
kubectl delete pod greedy-pod -n break-fix
kubectl run greedy-pod --namespace break-fix --image=nginx \
  --overrides='{"spec":{"containers":[{"name":"greedy","image":"nginx","resources":{"requests":{"cpu":"100m","memory":"128Mi"}}}]}}'

# Option 2: Scale up the node pool
az aks nodepool scale \
  --resource-group $RESOURCE_GROUP \
  --cluster-name $CLUSTER_NAME \
  --name nodepool1 \
  --node-count 3

🧹 Cleanup

bash

kubectl delete namespace break-fix
# Make sure CoreDNS is scaled back up if you broke it:
kubectl scale deployment coredns -n kube-system --replicas=2

📋 Break & Fix Quick Reference

Symptom	First Command	Likely Cause
ImagePullBackOff	`kubectl describe pod`	Wrong image, ACR auth, or image doesn't exist
CrashLoopBackOff	`kubectl logs --previous`	App crash, missing env var, OOMKilled
Pending	`kubectl describe pod` (Events)	Insufficient CPU/memory, node affinity mismatch
Node NotReady	`kubectl describe node`	VM issue, kubelet crash, disk pressure
DNS failures	`kubectl get pods -n kube-system`	CoreDNS down or misconfigured
403 Forbidden	`kubectl auth can-i --list`	Missing RBAC role/binding

← Deploy on AKS Debugging Scenarios →

← Back to AKS Course