Hands-on Lesson 13 of 14

Debugging Scenarios — kubectl & az CLI Diagnostics

A systematic debugging toolkit for AKS. Learn the exact commands, order of operations, and mental models that SREs use to diagnose production issues on Azure Kubernetes Service.

🧠 The Debugging Mindset

Production debugging is not guessing. It's a systematic narrowing process:

Cluster level — Are the nodes healthy? Is the API server responding?
Namespace level — Are pods running? Any recent events?
Pod level — What's the pod status? What do the logs say?
Container level — Can I exec into it? What's the process doing?
Network level — Can pods reach each other? Is DNS working?
Azure level — Is the underlying VM healthy? Any Azure outages?

💡

Golden Rule

Always start from the outside in. Check cluster health before looking at individual pods. A node problem will cause cascading pod failures — fixing the node fixes all the pods.

🔧 Level 1 — Cluster Health

Is the API Server Responding?

bash

# Quick health check — if this hangs, the control plane is down
kubectl cluster-info
# Kubernetes control plane is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443
# CoreDNS is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443/api/v1/...

# Check API server health directly
kubectl get --raw='/healthz'
# ok

# Check component status (deprecated but still useful)
kubectl get componentstatuses

Are Nodes Healthy?

bash

# List all nodes with status
kubectl get nodes -o wide

# Check node conditions — look for MemoryPressure, DiskPressure, PIDPressure
kubectl describe nodes | grep -A15 "Conditions:"

# Quick check: are any nodes NotReady?
kubectl get nodes | grep -v " Ready "

# Check node resource usage
kubectl top nodes
# NAME                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# aks-nodepool1-xxxxx-vmss0000      245m         12%    1834Mi          48%
# aks-nodepool1-xxxxx-vmss0001      189m         9%     1562Mi          41%

Azure-Level Cluster Check

bash

# Check AKS cluster provisioning state
az aks show --resource-group $RG --name $CLUSTER \
  --query "{state:provisioningState,powerState:powerState.code,k8sVersion:kubernetesVersion}" -o table
# State      PowerState  K8sVersion
# Succeeded  Running     1.29.4

# Check for ongoing operations
az aks show --resource-group $RG --name $CLUSTER \
  --query "provisioningState" -o tsv

# Get cluster diagnostic logs
az aks get-credentials --resource-group $RG --name $CLUSTER
az aks kollect --resource-group $RG --name $CLUSTER  # Captures diagnostics bundle

🔧 Level 2 — Namespace Overview

bash

# Get everything in a namespace at a glance
kubectl get all -n myapp

# Show recent events (MOST USEFUL COMMAND for debugging)
kubectl get events -n myapp --sort-by='.lastTimestamp' | tail -20

# Find pods NOT in Running state
kubectl get pods -n myapp --field-selector 'status.phase!=Running'

# Get resource quota usage (if quotas are set)
kubectl describe resourcequota -n myapp

❗

Events Are Your Best Friend

kubectl get events --sort-by='.lastTimestamp' is the single most useful debugging command. Events show scheduling failures, image pulls, volume mounts, liveness probe failures, and OOM kills — all in chronological order.

🔧 Level 3 — Pod Diagnostics

Pod Status Troubleshooting Matrix

Status	What It Means	First Debug Command
`Pending`	Not yet scheduled to a node	`kubectl describe pod <name>` → check Events
`ContainerCreating`	Image pulling or volume mounting	`kubectl describe pod <name>` → check Events
`ImagePullBackOff`	Can't pull the container image	`kubectl describe pod <name>` → check image name & ACR auth
`CrashLoopBackOff`	Container starts then crashes	`kubectl logs <name> --previous`
`Running` but not Ready	Readiness probe failing	`kubectl describe pod <name>` → check probe config
`Evicted`	Node ran out of resources	`kubectl describe pod <name>` → check reason field
`Terminating` (stuck)	Finalizer or graceful shutdown hang	`kubectl get pod <name> -o yaml \| grep finaliz`

Deep Pod Inspection

bash

# Full describe — shows events, conditions, volumes, env
kubectl describe pod my-pod -n myapp

# View container logs (current run)
kubectl logs my-pod -n myapp

# View previous container logs (after a crash)
kubectl logs my-pod -n myapp --previous

# Follow logs in real-time
kubectl logs my-pod -n myapp -f

# Logs from a specific container in a multi-container pod
kubectl logs my-pod -n myapp -c sidecar

# View logs from ALL pods with a label
kubectl logs -n myapp -l app=myapp --all-containers --tail=50

Exec Into a Running Container

bash

# Get a shell inside the container
kubectl exec -it my-pod -n myapp -- /bin/sh

# Quick checks from inside the container:
# 1. Check environment variables
env | sort

# 2. Check DNS resolution
nslookup redis-service.myapp.svc.cluster.local

# 3. Check connectivity to another service
wget -qO- http://backend-service:8080/health

# 4. Check filesystem
df -h
ls -la /app/

# 5. Check running processes
ps aux

🔧 Level 4 — Network Debugging

DNS Debugging

bash

# Deploy a debug pod with network tools
kubectl run netdebug --image=nicolaka/netshoot -it --rm --restart=Never -- bash

# From inside the debug pod:
# 1. Test DNS resolution
nslookup kubernetes.default.svc.cluster.local
nslookup myservice.myapp.svc.cluster.local

# 2. Test connectivity to a service
curl -v http://myservice.myapp.svc.cluster.local:80/

# 3. Test connectivity to external endpoints
curl -v https://mcr.microsoft.com

# 4. Check DNS config
cat /etc/resolv.conf
# nameserver 10.0.0.10  (CoreDNS ClusterIP)
# search myapp.svc.cluster.local svc.cluster.local cluster.local

Service & Endpoint Debugging

bash

# Check if service has endpoints (i.e., backing pods exist)
kubectl get endpoints myservice -n myapp
# NAME        ENDPOINTS                          AGE
# myservice   10.244.1.5:3000,10.244.2.3:3000    1h

# If ENDPOINTS is empty, the selector doesn't match any pods
# Compare service selector with pod labels:
kubectl get svc myservice -n myapp -o jsonpath='{.spec.selector}'
# {"app":"myapp"}

kubectl get pods -n myapp -l app=myapp
# (should show matching pods)

# Check Ingress
kubectl get ingress -n myapp
kubectl describe ingress myingress -n myapp

🔧 Level 5 — Azure-Level Diagnostics

bash

# Check AKS cluster resource health
az aks show --resource-group $RG --name $CLUSTER \
  --query "{fqdn:fqdn,state:provisioningState,power:powerState.code}" -o table

# List node pool details
az aks nodepool list --resource-group $RG --cluster-name $CLUSTER \
  --query "[].{name:name,count:count,vmSize:vmSize,mode:mode,state:provisioningState}" -o table

# Check for failed Azure operations on the MC_ resource group
MC_RG=$(az aks show --resource-group $RG --name $CLUSTER --query nodeResourceGroup -o tsv)
az monitor activity-log list --resource-group $MC_RG \
  --status Failed --offset 1h -o table

# Check Azure Load Balancer health (for Service type LoadBalancer)
az network lb show --resource-group $MC_RG \
  --name kubernetes --query "frontendIPConfigurations[].{name:name,ip:privateIPAddress}" -o table

# Run AKS diagnostics
az aks check-acr --resource-group $RG --name $CLUSTER --acr myacr.azurecr.io

# Get kubelet logs from a specific node via Azure CLI
az aks command invoke --resource-group $RG --name $CLUSTER \
  --command "kubectl logs -n kube-system -l component=kube-proxy --tail=50"

🏭 Real-World Debugging Walkthrough

Scenario: "The App Deployed But Users Get 502 Bad Gateway"

bash

# Step 1: Check if pods are running
kubectl get pods -n production -l app=web
# All 3 pods show Running and 1/1 Ready — so the app is up

# Step 2: Check the Ingress
kubectl describe ingress web-ingress -n production
# Shows backends: 10.244.1.5:3000, 10.244.2.3:3000, 10.244.3.7:3000

# Step 3: Check NGINX Ingress Controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=20
# [error] upstream prematurely closed connection while reading response header

# Step 4: The app is crashing during request processing — check app logs
kubectl logs -n production -l app=web --tail=20
# Error: connect ECONNREFUSED 10.0.0.5:5432 — can't reach the database!

# Root cause: Database service endpoint changed. Fix the connection string.
kubectl get svc postgres -n production
# The service ClusterIP is 10.0.0.5 — but the database pod has no matching label

kubectl get pods -n production -l app=postgres
# No resources found — the postgres pod was evicted!

kubectl get events -n production --sort-by='.lastTimestamp' | grep postgres
# Evicted due to node disk pressure

# Fix: Restart postgres and add resource requests to prevent eviction
kubectl rollout restart deployment postgres -n production

📋 Diagnostic Commands Cheat Sheet

What	Command
Cluster health	`kubectl cluster-info`
Node status	`kubectl get nodes -o wide`
Node resource usage	`kubectl top nodes`
All resources in namespace	`kubectl get all -n <ns>`
Recent events	`kubectl get events --sort-by='.lastTimestamp' -n <ns>`
Pod details	`kubectl describe pod <name> -n <ns>`
Current logs	`kubectl logs <pod> -n <ns>`
Previous crash logs	`kubectl logs <pod> -n <ns> --previous`
Shell into container	`kubectl exec -it <pod> -n <ns> -- /bin/sh`
DNS test	`kubectl run test --image=busybox --rm -it -- nslookup <svc>`
Service endpoints	`kubectl get endpoints <svc> -n <ns>`
AKS state	`az aks show -g <rg> -n <aks> --query provisioningState`
ACR auth check	`az aks check-acr -g <rg> -n <aks> --acr <acr>.azurecr.io`
Azure activity log	`az monitor activity-log list -g <mc_rg> --status Failed`

← Break & Fix Interview Preparation →

← Back to AKS Course