Intermediate Lesson 6 of 14

Scaling & Autoscaling

Learn every scaling lever AKS offers — from manual node scaling to event-driven autoscaling with KEDA — and how to combine them for production workloads.

🧒 Simple Explanation (ELI5)

Imagine a restaurant. When more customers arrive, you need more tables and waiters (that's scaling). The Cluster Autoscaler is a manager who watches the waiting line — if there are people with no table, they call in more tables (nodes). The HPA is a supervisor who watches how busy each waiter is — if waiters are overwhelmed, they hire more waiters (pods). KEDA is a special manager who also checks external signals — like a food delivery queue — and adds staff to handle those orders too. When the restaurant is quiet, all these managers send the extra staff home to save money.

🔧 Technical Explanation

Scaling Dimensions

Scaling Type	What Scales	Trigger	Tool
Manual Pod Scaling	Pod replicas	Human decision	`kubectl scale`
Manual Node Scaling	Nodes in a pool	Human decision	`az aks scale` / `az aks nodepool scale`
Horizontal Pod Autoscaler (HPA)	Pod replicas	CPU, memory, custom metrics	Built-in (metrics-server)
Vertical Pod Autoscaler (VPA)	Pod resource requests	Historical usage	VPA addon
Cluster Autoscaler (CA)	Nodes	Pending (unschedulable) pods	AKS-managed
KEDA	Pod replicas (or jobs)	External events (queues, cron, HTTP)	KEDA addon

Cluster Autoscaler

The Cluster Autoscaler (CA) runs as a managed component in AKS. It watches for pods that can't be scheduled due to insufficient resources and adds nodes. It also removes underutilized nodes after a cool-down period.

Cluster Autoscaler Loop

Pod Pending
(no node fits)

→

CA triggers
scale-up

→

New node
joins pool

→

Pod
scheduled

Scale-Down Flow

Node utilization
< 50% for 10 min

→

CA checks PDBs
& safe-to-evict

→

Pods drained

→

Node removed

Key Autoscaler Profile Settings

Setting	Default	Description
scan-interval	10s	How often CA checks for pending pods
scale-down-delay-after-add	10m	Wait time after adding a node before considering scale-down
scale-down-unneeded-time	10m	How long a node must be underutilized before removal
scale-down-utilization-threshold	0.5	Node is underutilized if below this CPU+memory ratio
max-graceful-termination-sec	600	Max time to wait for pod eviction during scale-down
skip-nodes-with-local-storage	true	Don't evict pods using emptyDir or hostPath
expander	random	Strategy for choosing which node pool to scale (random, most-pods, least-waste, priority)

💡

Expander Strategy

Use priority expander when mixing Spot and regular node pools — it lets you prefer Spot pools first and fall back to on-demand pools if Spot capacity is unavailable.

Horizontal Pod Autoscaler (HPA)

HPA adjusts pod replica counts based on observed metrics. It requires metrics-server (installed by default in AKS) for CPU/memory metrics, or a custom metrics adapter for application-level metrics.

yaml

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

⚠️

Resource Requests Required

HPA calculates utilization as current usage ÷ resource request. If your pods don't have resources.requests defined, HPA cannot calculate CPU/memory utilization and will not scale. Always set requests on pods that use HPA.

Vertical Pod Autoscaler (VPA)

VPA analyzes historical resource usage and recommends (or automatically applies) optimal CPU and memory requests. It's useful when you don't know the right resource sizing for workloads.

Mode	Behavior	Use Case
Off	Only provides recommendations, no changes	Audit existing workloads
Initial	Sets requests at pod creation only	Right-size new pods without disruption
Auto	Evicts and recreates pods with new requests	Fully automated (causes restarts)

❗

HPA + VPA Conflict

Do not use HPA and VPA on the same metric (e.g., both scaling on CPU). HPA changes replicas while VPA changes resource requests — they can fight. Use HPA for CPU scaling and VPA only for memory recommendations, or use VPA in "Off" mode alongside HPA.

KEDA (Kubernetes Event-Driven Autoscaling)

KEDA extends HPA by adding external event sources as scaling triggers. AKS offers KEDA as a managed addon.

Scaler	Trigger Source	Example
Azure Service Bus	Queue message count	Scale workers when queue depth > 10
Azure Storage Queue	Queue length	Process blob uploads
Cron	Time schedule	Scale up at 8 AM, down at 6 PM
Prometheus	Custom metrics	Scale on HTTP request rate
Azure Monitor	Log Analytics queries	Scale on custom app telemetry

yaml

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 120
  minReplicaCount: 1
  maxReplicaCount: 100
  triggers:
    - type: azure-servicebus
      metadata:
        queueName: orders
        messageCount: "5"
        connectionFromEnv: SERVICEBUS_CONNECTION

Spot Node Pools & Scaling

Spot node pools use Azure's excess capacity at up to 90% discount. They can be evicted at any time with 30-second notice.

⚠️

Spot Pool Evictions

During scale-up, Spot VMs may not be available (capacity-constrained). Cluster Autoscaler will retry but may fall back to a regular pool if the priority expander is configured. Never run stateful or critical workloads solely on Spot pools.

Best Practices

Practice	Why
Set PodDisruptionBudgets (PDB)	Prevents CA from evicting all replicas at once during scale-down
Use Priority Classes	Ensures critical pods are scheduled first; CA adds nodes for lower-priority pods
Over-provision with "pause" pods	Low-priority pods hold capacity; evicted instantly to make room for real pods — eliminates VM boot wait time
Set --max-surge on node pools	Controls how many extra nodes are added during upgrades — prevents cluster-level resource exhaustion
Use multiple node pools	Separate scaling profiles for different workload types (GPU, Spot, burstable)

⌨️ Hands-on

Enable Cluster Autoscaler

bash

# Enable autoscaler on the default node pool
az aks nodepool update \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name nodepool1 \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 10

# Verify autoscaler status
az aks show -g myResourceGroup -n myAKSCluster \
  --query "agentPoolProfiles[0].{name:name, minCount:minCount, maxCount:maxCount, enableAutoScaling:enableAutoScaling}" \
  -o table

# Name       MinCount    MaxCount    EnableAutoScaling
# ---------  ----------  ----------  -------------------
# nodepool1  2           10          True

Configure Autoscaler Profile

bash

# Tune cluster-wide autoscaler settings
az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --cluster-autoscaler-profile \
    scan-interval=20s \
    scale-down-delay-after-add=5m \
    scale-down-unneeded-time=5m \
    scale-down-utilization-threshold=0.6 \
    expander=priority

# Verify the profile
az aks show -g myResourceGroup -n myAKSCluster \
  --query "autoScalerProfile" -o json

Create HPA & Simulate Load

bash

# Deploy a sample app with resource requests
kubectl create deployment php-apache \
  --image=registry.k8s.io/hpa-example \
  --requests='cpu=200m'
kubectl expose deployment php-apache --port=80

# Create an HPA targeting 50% CPU
kubectl autoscale deployment php-apache \
  --cpu-percent=50 \
  --min=1 \
  --max=20

# Check HPA status
kubectl get hpa php-apache
# NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
# php-apache   Deployment/php-apache   0%/50%    1         20        1          30s

bash

# Generate load from a separate terminal
kubectl run load-gen --image=busybox:1.36 --restart=Never -- \
  /bin/sh -c "while true; do wget -q -O- http://php-apache; done"

# Watch scaling in real-time (in your main terminal)
kubectl get hpa php-apache -w

# NAME         REFERENCE               TARGETS    MINPODS   MAXPODS   REPLICAS
# php-apache   Deployment/php-apache   248%/50%   1         20        1
# php-apache   Deployment/php-apache   248%/50%   1         20        5
# php-apache   Deployment/php-apache   76%/50%    1         20        8
# php-apache   Deployment/php-apache   49%/50%    1         20        8

# Stop the load generator
kubectl delete pod load-gen

# After ~5 minutes, HPA scales back down
kubectl get hpa php-apache
# TARGETS   REPLICAS
# 0%/50%    1

Enable KEDA Addon

bash

# Enable KEDA on your AKS cluster
az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --enable-keda

# Verify KEDA operator is running
kubectl get pods -n kube-system -l app=keda-operator
# NAME                             READY   STATUS    RESTARTS
# keda-operator-7c8d6b5f4d-x9k2j  1/1     Running   0

# Check KEDA CRDs are installed
kubectl get crd | grep keda
# scaledobjects.keda.sh
# scaledjobs.keda.sh
# triggerauthentications.keda.sh

Create a PodDisruptionBudget

yaml

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-server

bash

kubectl apply -f pdb.yaml

kubectl get pdb -n production
# NAME      MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# api-pdb   2               N/A               1                     5s

🐛 Debugging Scenarios

Scenario 1: Cluster Autoscaler Not Adding Nodes

Symptom: Pods are stuck in Pending state but no new nodes appear even though autoscaler is enabled.

bash

# Step 1: Confirm pods are Pending due to insufficient resources
kubectl get pods --field-selector=status.phase=Pending
kubectl describe pod <pending-pod>
# Look for: "0/5 nodes are available: 5 Insufficient cpu"

# Step 2: Verify autoscaler is enabled and not at max
az aks nodepool show \
  -g myResourceGroup --cluster-name myAKSCluster -n nodepool1 \
  --query "{autoScaling:enableAutoScaling, min:minCount, max:maxCount, current:count}"
# If current == max, increase --max-count

# Step 3: Check Azure VM quota in the region
az vm list-usage --location eastus -o table | Select-String "Standard DS"
# If "CurrentValue" equals "Limit" — request a quota increase

# Step 4: Check for cluster-autoscaler events
kubectl get events --field-selector reason=ScaleUp -A --sort-by='.lastTimestamp'
kubectl get events --field-selector reason=NotTriggerScaleUp -A

# Step 5: Check autoscaler status configmap
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# Look for "ScaleUp" status, "backoff" entries, or "noScaleUp" reasons

# Step 6: If using Spot VMs, check if capacity is available
# Switch expander to allow fallback to regular pools
az aks update -g myResourceGroup -n myAKSCluster \
  --cluster-autoscaler-profile expander=priority

Scenario 2: HPA Not Scaling Up

Symptom: kubectl get hpa shows <unknown>/50% in the TARGETS column — HPA can't read metrics.

bash

# Step 1: Check if metrics-server is running
kubectl get deployment metrics-server -n kube-system
kubectl top nodes    # Should show CPU/memory for each node
kubectl top pods     # Should show CPU/memory for each pod

# Step 2: If metrics-server is not running or unhealthy
kubectl get pods -n kube-system -l k8s-app=metrics-server
kubectl logs -n kube-system -l k8s-app=metrics-server --tail=50

# Step 3: Verify the deployment has resource requests set
kubectl get deployment api-server -o jsonpath='{.spec.template.spec.containers[0].resources}'
# If empty — HPA cannot calculate %. Add resources.requests:
kubectl set resources deployment api-server --requests=cpu=200m,memory=256Mi

# Step 4: Check HPA events for specific errors
kubectl describe hpa api-hpa
# Look for: "failed to get cpu utilization" or "missing request for cpu"

# Step 5: If using custom metrics, verify the adapter
kubectl get apiservice v1beta1.custom.metrics.k8s.io -o yaml
# If status shows "False" — the custom metrics adapter is broken

# Step 6: After fixing, watch HPA recover
kubectl get hpa api-hpa -w

Scenario 3: Scale-Down Too Aggressive — Pods Getting Evicted

Symptom: The autoscaler keeps removing nodes during business hours, causing brief disruptions as pods are rescheduled.

bash

# Step 1: Check current autoscaler profile
az aks show -g myResourceGroup -n myAKSCluster \
  --query "autoScalerProfile" -o table

# Step 2: Increase scale-down cooldowns
az aks update -g myResourceGroup -n myAKSCluster \
  --cluster-autoscaler-profile \
    scale-down-delay-after-add=15m \
    scale-down-unneeded-time=15m \
    scale-down-utilization-threshold=0.7

# Step 3: Ensure PodDisruptionBudgets are set for critical services
kubectl get pdb -A
# If missing, create PDBs with minAvailable for each stateful/critical service

# Step 4: Annotate pods that should NOT be evicted
kubectl annotate pod important-pod-xyz \
  "cluster-autoscaler.kubernetes.io/safe-to-evict"="false"

# Step 5: Use over-provisioning — deploy low-priority "pause" pods
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: overprovisioning
value: -1
globalDefault: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning
spec:
  replicas: 2
  selector:
    matchLabels:
      run: overprovisioning
  template:
    metadata:
      labels:
        run: overprovisioning
    spec:
      priorityClassName: overprovisioning
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
EOF
# These pods hold capacity. When real pods need space, pause pods are evicted instantly
# — no waiting for new nodes to boot.

💡

Over-Provisioning Trick

Over-provisioning with low-priority pause pods is the single most effective technique for reducing scale-up latency. Real pods preempt pause pods instantly, while the CA adds replacement capacity in the background.

🎯 Interview Questions

Beginner

Q: What is the difference between manual scaling and autoscaling in AKS?▼

Manual scaling requires a human to run az aks scale (nodes) or kubectl scale (pods) to change capacity. Autoscaling automatically adjusts capacity based on signals: HPA scales pods based on CPU/memory, Cluster Autoscaler adds/removes nodes based on pending pods, and KEDA scales based on external events. Autoscaling is essential for workloads with variable traffic.

Q: What is the Cluster Autoscaler and how does it decide to add nodes?▼

The Cluster Autoscaler (CA) watches for pods in Pending state that can't be scheduled because no node has enough resources. When it finds such pods, it calculates which node pool can fit them and requests a new VM from Azure. The new node joins the cluster and the pending pod gets scheduled. CA checks this every scan-interval (default 10 seconds).

Q: What metrics does HPA use to scale pods?▼

By default, HPA uses CPU and memory utilization from the metrics-server. You specify a target percentage (e.g., 70% CPU). HPA calculates: desiredReplicas = ceil(currentReplicas × currentMetricValue / targetMetricValue). It also supports custom metrics (via Prometheus adapter) and external metrics (via KEDA or Azure Monitor adapter).

Q: What is a PodDisruptionBudget and why is it important for autoscaling?▼

A PodDisruptionBudget (PDB) tells Kubernetes the minimum number of replicas that must remain available during voluntary disruptions (like node drain during scale-down or upgrades). Without PDBs, the Cluster Autoscaler could evict all replicas of a service simultaneously, causing downtime. PDBs ensure graceful scale-down.

Q: What is KEDA and how is it different from HPA?▼

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA by adding support for external event sources as scaling triggers — such as Azure Service Bus queue depth, Storage Queue length, or cron schedules. Standard HPA only scales on pod CPU/memory or custom metrics from inside the cluster. KEDA can also scale to zero, which standard HPA cannot (minimum is 1 replica).

Intermediate

Q: How do Cluster Autoscaler and HPA work together?▼

They operate at different layers and complement each other. HPA increases pod replicas when CPU/memory is high. If the cluster doesn't have enough node capacity for the new pods, they go Pending. The Cluster Autoscaler detects the Pending pods and adds nodes. When load drops, HPA reduces replicas, nodes become underutilized, and CA removes them. The chain is: metrics → HPA → more pods → Pending → CA → more nodes.

Q: What are Cluster Autoscaler expander strategies?▼

random: picks a random eligible node pool (default). most-pods: picks the pool that can schedule the most pending pods. least-waste: picks the pool with the least idle CPU/memory after scheduling. priority: uses a ConfigMap to define pool priority order — ideal for preferring Spot pools with fallback to on-demand. The priority expander is most commonly used in production.

Q: Why can't you run HPA and VPA on the same metric?▼

HPA and VPA both react to the same signals but take opposite actions. If both target CPU: HPA adds pods to reduce per-pod CPU utilization, while VPA increases resource requests per pod. VPA's increase makes HPA think utilization dropped (larger denominator), so HPA scales down. This creates oscillation. Best practice: use HPA for CPU-based scaling, VPA in "Off" mode for recommendation only, or VPA only for memory.

Q: How does the --max-surge setting affect AKS node pool upgrades?▼

--max-surge controls how many extra nodes are created during a rolling upgrade. For example, --max-surge 1 adds 1 extra node, cordons an old node, drains its pods to the new one, then deletes the old node. Higher values speed up upgrades but consume more Azure quota temporarily. Use --max-surge 33% for a balance of speed and resource usage. For Spot pools, surge nodes may not be available, so the upgrade may stall.

Q: What is the over-provisioning pattern and how does it reduce scale-up latency?▼

Over-provisioning deploys low-priority "pause" pods (using registry.k8s.io/pause) that reserve node capacity. These pods have a PriorityClass with a very low value (e.g., -1). When real workload pods need scheduling, Kubernetes preempts the pause pods instantly — no VM boot delay. The CA then backfills by adding new nodes for the evicted pause pods. This turns a 3-5 minute scale-up wait into near-instant scheduling.

Scenario-Based

Q: Your e-commerce platform has a flash sale at 9 AM. Traffic goes from 100 RPS to 5,000 RPS in 2 minutes. HPA scales pods but nodes take 4 minutes to provision. How do you ensure zero downtime?▼

1) Deploy over-provisioning pause pods that reserve 10-15 nodes of headroom — real pods preempt them instantly. 2) Use KEDA with a cron trigger to pre-scale the deployment to 50% capacity 15 minutes before the sale. 3) Set aggressive HPA scaleUp behavior: stabilizationWindowSeconds: 0 with 100% scale-up policy. 4) Combine Spot and on-demand node pools with priority expander. 5) Pre-warm the cluster by bumping --min-count on the node pool before the event.

Q: The Cluster Autoscaler added 5 nodes during a traffic spike but now, 2 hours later with low traffic, those nodes still exist. Why aren't they being removed?▼

Possible causes: 1) scale-down-unneeded-time hasn't elapsed yet (default 10m, but may be set higher). 2) Pods on those nodes have safe-to-evict: false annotation. 3) Pods use local storage (emptyDir) and skip-nodes-with-local-storage is true. 4) A PodDisruptionBudget prevents eviction (PDB allows 0 disruptions). 5) Pods belong to DaemonSets (these don't block scale-down but increase utilization). Check with: kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml — it shows why each node is not eligible for removal.

Q: Your team uses Spot node pools for batch processing. During peak hours, Spot VMs keep getting evicted. How do you handle this?▼

1) Use the priority expander with Spot as primary and a regular pool as fallback. 2) Set tolerations on batch pods for the Spot taint (kubernetes.azure.com/scalesetpriority: spot). 3) Use multiple Spot pools with different VM sizes to increase capacity availability. 4) Design batch jobs to be idempotent and checkpoint-able so eviction doesn't lose progress. 5) Set terminationGracePeriodSeconds: 25 (Spot gets 30s notice) to save state. 6) Use KEDA ScaledJobs for batch processing — KEDA creates new Job pods to replace evicted ones.

Q: HPA shows the correct CPU percentage but isn't scaling your deployment. What do you check?▼

1) Check if current replicas already equal maxReplicas: kubectl get hpa. 2) Check the scaleUp.stabilizationWindowSeconds — HPA waits this long before scaling to avoid flapping. 3) Check if the deployment has a spec.replicas that's been set manually — this can conflict with HPA. 4) Verify HPA behavior policies — a restrictive scaleUp.policies might limit how fast scaling happens. 5) Check HPA events: kubectl describe hpa for "FailedGetScale" or "ScalingLimited" events.

Q: You have a message processing service that needs to scale based on Azure Service Bus queue depth. The queue can go from 0 to 10,000 messages during business hours and be empty at night. Design the scaling strategy.▼

Use KEDA with two triggers: 1) azure-servicebus scaler targeting the queue, with messageCount: 10 (1 pod per 10 messages), minReplicaCount: 0 (scale to zero at night), maxReplicaCount: 50. 2) A cron trigger to pre-warm to 5 replicas at 7:45 AM before business hours start. Set cooldownPeriod: 300 to avoid premature scale-down between message bursts. Use workload identity for KEDA to authenticate to Service Bus. Pair with Cluster Autoscaler to handle the node-level scaling for up to 50 pods.

🌍 Real-World Use Case

E-Commerce Flash Sale — KEDA + Cluster Autoscaler

An online retailer handles 500 RPS normally but sees 10x traffic spikes during flash sales. Their scaling architecture:

Baseline: 5 nodes (Standard_D4s_v5, 4 vCPU / 16 GB each), web pods min 6 replicas via HPA.
Pre-sale: KEDA cron trigger scales web pods to 30 replicas at T-15 minutes. Cluster Autoscaler adds nodes to fit them.
During sale: HPA targets 60% CPU with aggressive scale-up (100% increase per 30s). KEDA scales order-processor pods based on Service Bus queue depth (1 pod per 5 messages, max 100).
Node pools: Primary on-demand pool (min 5, max 20) + Spot pool (min 0, max 30) with priority expander preferring Spot first.
Over-provisioning: 3 pause pods reserving 3 nodes of headroom — instant scheduling for the first wave of scale-up.
PDBs: Web pods minAvailable: 4, checkout pods minAvailable: 2 — ensures availability during autoscaler scale-down.
Post-sale: HPA/KEDA scale down over 10 minutes (stabilization window). CA removes idle nodes after 15 minutes. Spot pool drains to 0.
Result: Zero downtime during 10x traffic spikes, 60% cost reduction with Spot nodes, 95th percentile latency under 200ms.

📝 Summary

Manual scaling (az aks scale, kubectl scale) is good for planned changes; autoscaling handles the unpredictable.
Cluster Autoscaler adds/removes nodes based on pending pods — configure min/max counts and profile settings carefully.
HPA scales pod replicas on CPU/memory — requires resource requests to be set on pods.
VPA right-sizes pod resource requests — use in "Off" mode alongside HPA to avoid conflicts.
KEDA scales on external events (queues, cron, custom metrics) and can scale to zero — ideal for event-driven workloads.
Over-provisioning with low-priority pause pods eliminates VM boot latency — schedule real pods instantly.
PodDisruptionBudgets are mandatory for production — they prevent the autoscaler from causing downtime during scale-down.
Combine HPA + CA + KEDA + Spot pools + PDBs for a cost-efficient, resilient autoscaling strategy.

← Previous: Networking in AKS Next: ACR Integration & Image Management →

← Back to AKS Topics