Hands-on Lesson 12 of 14

Break & Fix Lab

Intentionally break Kubernetes deployments, then diagnose and fix them. The best way to learn debugging is by breaking things on purpose.

🧒 Simple Explanation (ELI5)

You know how mechanics learn by taking cars apart and putting them back together? That's what we're doing. We'll break things on purpose — wrong image, bad config, missing labels — and then fix them. After this lab, you'll be able to diagnose real K8s problems in seconds instead of hours.

💡
How This Lab Works

Each challenge gives you a broken YAML or a break command. Your job: figure out what's wrong, diagnose with kubectl, and fix it. Solutions are provided — but try to solve it yourself first!

Setup: Create the Lab Namespace

bash
kubectl create namespace break-fix-lab

🔴 Challenge 1: Wrong Image Name

Scenario: A deployment was created but pods never become Ready.

yaml
# challenge-1.yaml — Deploy this broken manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: challenge-1
  namespace: break-fix-lab
spec:
  replicas: 2
  selector:
    matchLabels:
      app: challenge-1
  template:
    metadata:
      labels:
        app: challenge-1
    spec:
      containers:
        - name: web
          image: ngnix:latest    # ← Bug is here
          ports:
            - containerPort: 80
bash
kubectl apply -f challenge-1.yaml
🤔
Your Task

Pods are stuck. Diagnose why and fix it without deleting the deployment.

Solution
bash
# Step 1: Check pod status
kubectl get pods -n break-fix-lab
# STATUS: ImagePullBackOff or ErrImagePull

# Step 2: Describe the pod
kubectl describe pod -l app=challenge-1 -n break-fix-lab
# Events show: "Failed to pull image 'ngnix:latest': image not found"

# Step 3: Spot the typo — "ngnix" should be "nginx"

# Step 4: Fix it
kubectl set image deployment/challenge-1 web=nginx:latest -n break-fix-lab

# Step 5: Verify
kubectl get pods -n break-fix-lab -w
# Pods transition to Running

Root cause: Image name typo — ngnix instead of nginx.

🔴 Challenge 2: Mismatched Selector Labels

Scenario: Service is created but has no endpoints. The app is unreachable.

yaml
# challenge-2.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: challenge-2
  namespace: break-fix-lab
spec:
  replicas: 2
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: web
          image: nginx:1.25-alpine
          ports:
            - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: challenge-2-svc
  namespace: break-fix-lab
spec:
  selector:
    app: my-app         # ← Bug: "my-app" vs "myapp"
  ports:
    - port: 80
      targetPort: 80
bash
kubectl apply -f challenge-2.yaml
🤔
Your Task

Pods are running but the service can't reach them. Find and fix the issue.

Solution
bash
# Step 1: Check endpoints
kubectl get endpoints challenge-2-svc -n break-fix-lab
# ENDPOINTS: <none>

# Step 2: Compare labels
kubectl get pods -n break-fix-lab --show-labels | grep challenge-2
# Labels: app=myapp

kubectl get svc challenge-2-svc -n break-fix-lab -o yaml | grep -A2 selector
# selector: app: my-app    ← MISMATCH! "my-app" ≠ "myapp"

# Step 3: Fix the service selector
kubectl patch svc challenge-2-svc -n break-fix-lab \
  -p '{"spec":{"selector":{"app":"myapp"}}}'

# Step 4: Verify
kubectl get endpoints challenge-2-svc -n break-fix-lab
# Should now show pod IPs

Root cause: Service selector my-app doesn't match pod label myapp (missing hyphen).

🔴 Challenge 3: CrashLoopBackOff

Scenario: Pod starts but immediately crashes, restarting in a loop.

yaml
# challenge-3.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: challenge-3
  namespace: break-fix-lab
spec:
  replicas: 1
  selector:
    matchLabels:
      app: challenge-3
  template:
    metadata:
      labels:
        app: challenge-3
    spec:
      containers:
        - name: app
          image: busybox:1.36
          command: ["cat", "/config/app.conf"]   # ← File doesn't exist
          volumeMounts:
            - name: config
              mountPath: /config
      volumes:
        - name: config
          emptyDir: {}    # ← Empty — no app.conf inside
Solution
bash
# Step 1: Check status
kubectl get pods -n break-fix-lab -l app=challenge-3
# STATUS: CrashLoopBackOff, RESTARTS increasing

# Step 2: Check logs
kubectl logs -l app=challenge-3 -n break-fix-lab
# "cat: can't open '/config/app.conf': No such file or directory"

# Step 3: The volume is emptyDir with no content. Supply the file via ConfigMap.

# Fix: Create a ConfigMap with the expected file
kubectl create configmap challenge-3-config \
  --from-literal=app.conf="server.port=8080" \
  -n break-fix-lab

# Patch the deployment to use ConfigMap volume
kubectl patch deployment challenge-3 -n break-fix-lab --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/volumes/0","value":{"name":"config","configMap":{"name":"challenge-3-config"}}},{"op":"replace","path":"/spec/template/spec/containers/0/command","value":["sh","-c","cat /config/app.conf ; sleep 3600"]}]'

# Verify
kubectl get pods -n break-fix-lab -l app=challenge-3
# STATUS: Running

Root cause: The container expects a file at /config/app.conf but the volume is an empty directory.

🔴 Challenge 4: Missing Secret Reference

Scenario: Pod won't start — stuck in CreateContainerConfigError.

yaml
# challenge-4.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: challenge-4
  namespace: break-fix-lab
spec:
  replicas: 1
  selector:
    matchLabels:
      app: challenge-4
  template:
    metadata:
      labels:
        app: challenge-4
    spec:
      containers:
        - name: app
          image: nginx:1.25-alpine
          env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials     # ← This secret doesn't exist
                  key: password
Solution
bash
# Step 1: Check status
kubectl get pods -n break-fix-lab -l app=challenge-4
# STATUS: CreateContainerConfigError

# Step 2: Describe the pod
kubectl describe pod -l app=challenge-4 -n break-fix-lab
# Warning: "secret 'db-credentials' not found"

# Step 3: Create the missing secret
kubectl create secret generic db-credentials \
  --from-literal=password='MySecurePass123' \
  -n break-fix-lab

# Step 4: Restart pods to pick up the new secret
kubectl rollout restart deployment/challenge-4 -n break-fix-lab

# Verify
kubectl get pods -n break-fix-lab -l app=challenge-4
# STATUS: Running

Root cause: Pod references a secret that doesn't exist. K8s can't create the container without it.

🔴 Challenge 5: Resource Limit Too Low

Scenario: Pod gets OOMKilled repeatedly.

yaml
# challenge-5.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: challenge-5
  namespace: break-fix-lab
spec:
  replicas: 1
  selector:
    matchLabels:
      app: challenge-5
  template:
    metadata:
      labels:
        app: challenge-5
    spec:
      containers:
        - name: stress
          image: progrium/stress
          command: ["stress", "--vm", "1", "--vm-bytes", "128M"]
          resources:
            limits:
              memory: 64Mi    # ← Too low for 128M allocation
Solution
bash
# Step 1: Check status
kubectl get pods -n break-fix-lab -l app=challenge-5
# STATUS: OOMKilled, CrashLoopBackOff

# Step 2: Pod describe or previous logs
kubectl describe pod -l app=challenge-5 -n break-fix-lab
# Last State: Terminated, Reason: OOMKilled

# Step 3: The process tries to allocate 128M but limit is 64Mi

# Fix: Increase memory limit
kubectl patch deployment challenge-5 -n break-fix-lab --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/limits/memory","value":"256Mi"}]'

# Verify
kubectl get pods -n break-fix-lab -l app=challenge-5 -w

Root cause: Memory limit (64Mi) is lower than what the process needs (128M). Kubernetes OOM-kills the container.

🔴 Challenge 6: Liveness Probe Misconfigured

Scenario: Pods keep restarting every few minutes even though the app is healthy.

yaml
# challenge-6.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: challenge-6
  namespace: break-fix-lab
spec:
  replicas: 1
  selector:
    matchLabels:
      app: challenge-6
  template:
    metadata:
      labels:
        app: challenge-6
    spec:
      containers:
        - name: web
          image: nginx:1.25-alpine
          ports:
            - containerPort: 80
          livenessProbe:
            httpGet:
              path: /healthz    # ← nginx returns 404 for /healthz
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 2
Solution
bash
# Step 1: Watch pods — RESTARTS keeps incrementing
kubectl get pods -n break-fix-lab -l app=challenge-6 -w

# Step 2: Describe → Events
kubectl describe pod -l app=challenge-6 -n break-fix-lab
# "Liveness probe failed: HTTP probe failed with statuscode: 404"

# Step 3: nginx doesn't serve /healthz — it serves / which returns 200

# Fix: Change probe path to /
kubectl patch deployment challenge-6 -n break-fix-lab --type=json \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/livenessProbe/httpGet/path","value":"/"}]'

# Verify — RESTARTS should stop incrementing
kubectl get pods -n break-fix-lab -l app=challenge-6 -w

Root cause: Liveness probe checks /healthz but nginx's default config doesn't have that endpoint. Returns 404 → probe fails → K8s restarts the container.

🧹 Cleanup

bash
kubectl delete namespace break-fix-lab

📝 Debugging Cheat Sheet

StatusFirst CommandWhat to Look For
ImagePullBackOffkubectl describe podWrong image name/tag, missing registry credentials
CrashLoopBackOffkubectl logsApp error, missing file/env, exit code
CreateContainerConfigErrorkubectl describe podMissing ConfigMap or Secret
OOMKilledkubectl describe podMemory limit too low for workload
Pendingkubectl describe podInsufficient CPU/memory, no matching nodeSelector
Running (restarts)kubectl describe podLiveness probe failing
Service no endpointskubectl get endpointsSelector labels don't match pod labels

📝 Summary

← Back to Kubernetes Course