Hands-on Lesson 13 of 14

Debugging Scenarios — kubectl & az CLI Diagnostics

A systematic debugging toolkit for AKS. Learn the exact commands, order of operations, and mental models that SREs use to diagnose production issues on Azure Kubernetes Service.

🧠 The Debugging Mindset

Production debugging is not guessing. It's a systematic narrowing process:

  1. Cluster level — Are the nodes healthy? Is the API server responding?
  2. Namespace level — Are pods running? Any recent events?
  3. Pod level — What's the pod status? What do the logs say?
  4. Container level — Can I exec into it? What's the process doing?
  5. Network level — Can pods reach each other? Is DNS working?
  6. Azure level — Is the underlying VM healthy? Any Azure outages?
💡
Golden Rule

Always start from the outside in. Check cluster health before looking at individual pods. A node problem will cause cascading pod failures — fixing the node fixes all the pods.

🔧 Level 1 — Cluster Health

Is the API Server Responding?

bash
# Quick health check — if this hangs, the control plane is down
kubectl cluster-info
# Kubernetes control plane is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443
# CoreDNS is running at https://myaks-xxxxx.hcp.eastus.azmk8s.io:443/api/v1/...

# Check API server health directly
kubectl get --raw='/healthz'
# ok

# Check component status (deprecated but still useful)
kubectl get componentstatuses

Are Nodes Healthy?

bash
# List all nodes with status
kubectl get nodes -o wide

# Check node conditions — look for MemoryPressure, DiskPressure, PIDPressure
kubectl describe nodes | grep -A15 "Conditions:"

# Quick check: are any nodes NotReady?
kubectl get nodes | grep -v " Ready "

# Check node resource usage
kubectl top nodes
# NAME                              CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# aks-nodepool1-xxxxx-vmss0000      245m         12%    1834Mi          48%
# aks-nodepool1-xxxxx-vmss0001      189m         9%     1562Mi          41%

Azure-Level Cluster Check

bash
# Check AKS cluster provisioning state
az aks show --resource-group $RG --name $CLUSTER \
  --query "{state:provisioningState,powerState:powerState.code,k8sVersion:kubernetesVersion}" -o table
# State      PowerState  K8sVersion
# Succeeded  Running     1.29.4

# Check for ongoing operations
az aks show --resource-group $RG --name $CLUSTER \
  --query "provisioningState" -o tsv

# Get cluster diagnostic logs
az aks get-credentials --resource-group $RG --name $CLUSTER
az aks kollect --resource-group $RG --name $CLUSTER  # Captures diagnostics bundle

🔧 Level 2 — Namespace Overview

bash
# Get everything in a namespace at a glance
kubectl get all -n myapp

# Show recent events (MOST USEFUL COMMAND for debugging)
kubectl get events -n myapp --sort-by='.lastTimestamp' | tail -20

# Find pods NOT in Running state
kubectl get pods -n myapp --field-selector 'status.phase!=Running'

# Get resource quota usage (if quotas are set)
kubectl describe resourcequota -n myapp
Events Are Your Best Friend

kubectl get events --sort-by='.lastTimestamp' is the single most useful debugging command. Events show scheduling failures, image pulls, volume mounts, liveness probe failures, and OOM kills — all in chronological order.

🔧 Level 3 — Pod Diagnostics

Pod Status Troubleshooting Matrix

StatusWhat It MeansFirst Debug Command
PendingNot yet scheduled to a nodekubectl describe pod <name> → check Events
ContainerCreatingImage pulling or volume mountingkubectl describe pod <name> → check Events
ImagePullBackOffCan't pull the container imagekubectl describe pod <name> → check image name & ACR auth
CrashLoopBackOffContainer starts then crasheskubectl logs <name> --previous
Running but not ReadyReadiness probe failingkubectl describe pod <name> → check probe config
EvictedNode ran out of resourceskubectl describe pod <name> → check reason field
Terminating (stuck)Finalizer or graceful shutdown hangkubectl get pod <name> -o yaml | grep finaliz

Deep Pod Inspection

bash
# Full describe — shows events, conditions, volumes, env
kubectl describe pod my-pod -n myapp

# View container logs (current run)
kubectl logs my-pod -n myapp

# View previous container logs (after a crash)
kubectl logs my-pod -n myapp --previous

# Follow logs in real-time
kubectl logs my-pod -n myapp -f

# Logs from a specific container in a multi-container pod
kubectl logs my-pod -n myapp -c sidecar

# View logs from ALL pods with a label
kubectl logs -n myapp -l app=myapp --all-containers --tail=50

Exec Into a Running Container

bash
# Get a shell inside the container
kubectl exec -it my-pod -n myapp -- /bin/sh

# Quick checks from inside the container:
# 1. Check environment variables
env | sort

# 2. Check DNS resolution
nslookup redis-service.myapp.svc.cluster.local

# 3. Check connectivity to another service
wget -qO- http://backend-service:8080/health

# 4. Check filesystem
df -h
ls -la /app/

# 5. Check running processes
ps aux

🔧 Level 4 — Network Debugging

DNS Debugging

bash
# Deploy a debug pod with network tools
kubectl run netdebug --image=nicolaka/netshoot -it --rm --restart=Never -- bash

# From inside the debug pod:
# 1. Test DNS resolution
nslookup kubernetes.default.svc.cluster.local
nslookup myservice.myapp.svc.cluster.local

# 2. Test connectivity to a service
curl -v http://myservice.myapp.svc.cluster.local:80/

# 3. Test connectivity to external endpoints
curl -v https://mcr.microsoft.com

# 4. Check DNS config
cat /etc/resolv.conf
# nameserver 10.0.0.10  (CoreDNS ClusterIP)
# search myapp.svc.cluster.local svc.cluster.local cluster.local

Service & Endpoint Debugging

bash
# Check if service has endpoints (i.e., backing pods exist)
kubectl get endpoints myservice -n myapp
# NAME        ENDPOINTS                          AGE
# myservice   10.244.1.5:3000,10.244.2.3:3000    1h

# If ENDPOINTS is empty, the selector doesn't match any pods
# Compare service selector with pod labels:
kubectl get svc myservice -n myapp -o jsonpath='{.spec.selector}'
# {"app":"myapp"}

kubectl get pods -n myapp -l app=myapp
# (should show matching pods)

# Check Ingress
kubectl get ingress -n myapp
kubectl describe ingress myingress -n myapp

🔧 Level 5 — Azure-Level Diagnostics

bash
# Check AKS cluster resource health
az aks show --resource-group $RG --name $CLUSTER \
  --query "{fqdn:fqdn,state:provisioningState,power:powerState.code}" -o table

# List node pool details
az aks nodepool list --resource-group $RG --cluster-name $CLUSTER \
  --query "[].{name:name,count:count,vmSize:vmSize,mode:mode,state:provisioningState}" -o table

# Check for failed Azure operations on the MC_ resource group
MC_RG=$(az aks show --resource-group $RG --name $CLUSTER --query nodeResourceGroup -o tsv)
az monitor activity-log list --resource-group $MC_RG \
  --status Failed --offset 1h -o table

# Check Azure Load Balancer health (for Service type LoadBalancer)
az network lb show --resource-group $MC_RG \
  --name kubernetes --query "frontendIPConfigurations[].{name:name,ip:privateIPAddress}" -o table

# Run AKS diagnostics
az aks check-acr --resource-group $RG --name $CLUSTER --acr myacr.azurecr.io

# Get kubelet logs from a specific node via Azure CLI
az aks command invoke --resource-group $RG --name $CLUSTER \
  --command "kubectl logs -n kube-system -l component=kube-proxy --tail=50"

🏭 Real-World Debugging Walkthrough

Scenario: "The App Deployed But Users Get 502 Bad Gateway"

bash
# Step 1: Check if pods are running
kubectl get pods -n production -l app=web
# All 3 pods show Running and 1/1 Ready — so the app is up

# Step 2: Check the Ingress
kubectl describe ingress web-ingress -n production
# Shows backends: 10.244.1.5:3000, 10.244.2.3:3000, 10.244.3.7:3000

# Step 3: Check NGINX Ingress Controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=20
# [error] upstream prematurely closed connection while reading response header

# Step 4: The app is crashing during request processing — check app logs
kubectl logs -n production -l app=web --tail=20
# Error: connect ECONNREFUSED 10.0.0.5:5432 — can't reach the database!

# Root cause: Database service endpoint changed. Fix the connection string.
kubectl get svc postgres -n production
# The service ClusterIP is 10.0.0.5 — but the database pod has no matching label

kubectl get pods -n production -l app=postgres
# No resources found — the postgres pod was evicted!

kubectl get events -n production --sort-by='.lastTimestamp' | grep postgres
# Evicted due to node disk pressure

# Fix: Restart postgres and add resource requests to prevent eviction
kubectl rollout restart deployment postgres -n production

📋 Diagnostic Commands Cheat Sheet

WhatCommand
Cluster healthkubectl cluster-info
Node statuskubectl get nodes -o wide
Node resource usagekubectl top nodes
All resources in namespacekubectl get all -n <ns>
Recent eventskubectl get events --sort-by='.lastTimestamp' -n <ns>
Pod detailskubectl describe pod <name> -n <ns>
Current logskubectl logs <pod> -n <ns>
Previous crash logskubectl logs <pod> -n <ns> --previous
Shell into containerkubectl exec -it <pod> -n <ns> -- /bin/sh
DNS testkubectl run test --image=busybox --rm -it -- nslookup <svc>
Service endpointskubectl get endpoints <svc> -n <ns>
AKS stateaz aks show -g <rg> -n <aks> --query provisioningState
ACR auth checkaz aks check-acr -g <rg> -n <aks> --acr <acr>.azurecr.io
Azure activity logaz monitor activity-log list -g <mc_rg> --status Failed
← Back to AKS Course