Scaling & Autoscaling
Learn every scaling lever AKS offers — from manual node scaling to event-driven autoscaling with KEDA — and how to combine them for production workloads.
🧒 Simple Explanation (ELI5)
Imagine a restaurant. When more customers arrive, you need more tables and waiters (that's scaling). The Cluster Autoscaler is a manager who watches the waiting line — if there are people with no table, they call in more tables (nodes). The HPA is a supervisor who watches how busy each waiter is — if waiters are overwhelmed, they hire more waiters (pods). KEDA is a special manager who also checks external signals — like a food delivery queue — and adds staff to handle those orders too. When the restaurant is quiet, all these managers send the extra staff home to save money.
🔧 Technical Explanation
Scaling Dimensions
| Scaling Type | What Scales | Trigger | Tool |
|---|---|---|---|
| Manual Pod Scaling | Pod replicas | Human decision | kubectl scale |
| Manual Node Scaling | Nodes in a pool | Human decision | az aks scale / az aks nodepool scale |
| Horizontal Pod Autoscaler (HPA) | Pod replicas | CPU, memory, custom metrics | Built-in (metrics-server) |
| Vertical Pod Autoscaler (VPA) | Pod resource requests | Historical usage | VPA addon |
| Cluster Autoscaler (CA) | Nodes | Pending (unschedulable) pods | AKS-managed |
| KEDA | Pod replicas (or jobs) | External events (queues, cron, HTTP) | KEDA addon |
Cluster Autoscaler
The Cluster Autoscaler (CA) runs as a managed component in AKS. It watches for pods that can't be scheduled due to insufficient resources and adds nodes. It also removes underutilized nodes after a cool-down period.
(no node fits)
scale-up
joins pool
scheduled
< 50% for 10 min
& safe-to-evict
Key Autoscaler Profile Settings
| Setting | Default | Description |
|---|---|---|
| scan-interval | 10s | How often CA checks for pending pods |
| scale-down-delay-after-add | 10m | Wait time after adding a node before considering scale-down |
| scale-down-unneeded-time | 10m | How long a node must be underutilized before removal |
| scale-down-utilization-threshold | 0.5 | Node is underutilized if below this CPU+memory ratio |
| max-graceful-termination-sec | 600 | Max time to wait for pod eviction during scale-down |
| skip-nodes-with-local-storage | true | Don't evict pods using emptyDir or hostPath |
| expander | random | Strategy for choosing which node pool to scale (random, most-pods, least-waste, priority) |
Use priority expander when mixing Spot and regular node pools — it lets you prefer Spot pools first and fall back to on-demand pools if Spot capacity is unavailable.
Horizontal Pod Autoscaler (HPA)
HPA adjusts pod replica counts based on observed metrics. It requires metrics-server (installed by default in AKS) for CPU/memory metrics, or a custom metrics adapter for application-level metrics.
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60HPA calculates utilization as current usage ÷ resource request. If your pods don't have resources.requests defined, HPA cannot calculate CPU/memory utilization and will not scale. Always set requests on pods that use HPA.
Vertical Pod Autoscaler (VPA)
VPA analyzes historical resource usage and recommends (or automatically applies) optimal CPU and memory requests. It's useful when you don't know the right resource sizing for workloads.
| Mode | Behavior | Use Case |
|---|---|---|
| Off | Only provides recommendations, no changes | Audit existing workloads |
| Initial | Sets requests at pod creation only | Right-size new pods without disruption |
| Auto | Evicts and recreates pods with new requests | Fully automated (causes restarts) |
Do not use HPA and VPA on the same metric (e.g., both scaling on CPU). HPA changes replicas while VPA changes resource requests — they can fight. Use HPA for CPU scaling and VPA only for memory recommendations, or use VPA in "Off" mode alongside HPA.
KEDA (Kubernetes Event-Driven Autoscaling)
KEDA extends HPA by adding external event sources as scaling triggers. AKS offers KEDA as a managed addon.
| Scaler | Trigger Source | Example |
|---|---|---|
| Azure Service Bus | Queue message count | Scale workers when queue depth > 10 |
| Azure Storage Queue | Queue length | Process blob uploads |
| Cron | Time schedule | Scale up at 8 AM, down at 6 PM |
| Prometheus | Custom metrics | Scale on HTTP request rate |
| Azure Monitor | Log Analytics queries | Scale on custom app telemetry |
# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor
namespace: production
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15
cooldownPeriod: 120
minReplicaCount: 1
maxReplicaCount: 100
triggers:
- type: azure-servicebus
metadata:
queueName: orders
messageCount: "5"
connectionFromEnv: SERVICEBUS_CONNECTIONSpot Node Pools & Scaling
Spot node pools use Azure's excess capacity at up to 90% discount. They can be evicted at any time with 30-second notice.
During scale-up, Spot VMs may not be available (capacity-constrained). Cluster Autoscaler will retry but may fall back to a regular pool if the priority expander is configured. Never run stateful or critical workloads solely on Spot pools.
Best Practices
| Practice | Why |
|---|---|
| Set PodDisruptionBudgets (PDB) | Prevents CA from evicting all replicas at once during scale-down |
| Use Priority Classes | Ensures critical pods are scheduled first; CA adds nodes for lower-priority pods |
| Over-provision with "pause" pods | Low-priority pods hold capacity; evicted instantly to make room for real pods — eliminates VM boot wait time |
| Set --max-surge on node pools | Controls how many extra nodes are added during upgrades — prevents cluster-level resource exhaustion |
| Use multiple node pools | Separate scaling profiles for different workload types (GPU, Spot, burstable) |
⌨️ Hands-on
Enable Cluster Autoscaler
# Enable autoscaler on the default node pool
az aks nodepool update \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name nodepool1 \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 10
# Verify autoscaler status
az aks show -g myResourceGroup -n myAKSCluster \
--query "agentPoolProfiles[0].{name:name, minCount:minCount, maxCount:maxCount, enableAutoScaling:enableAutoScaling}" \
-o table
# Name MinCount MaxCount EnableAutoScaling
# --------- ---------- ---------- -------------------
# nodepool1 2 10 TrueConfigure Autoscaler Profile
# Tune cluster-wide autoscaler settings
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--cluster-autoscaler-profile \
scan-interval=20s \
scale-down-delay-after-add=5m \
scale-down-unneeded-time=5m \
scale-down-utilization-threshold=0.6 \
expander=priority
# Verify the profile
az aks show -g myResourceGroup -n myAKSCluster \
--query "autoScalerProfile" -o jsonCreate HPA & Simulate Load
# Deploy a sample app with resource requests kubectl create deployment php-apache \ --image=registry.k8s.io/hpa-example \ --requests='cpu=200m' kubectl expose deployment php-apache --port=80 # Create an HPA targeting 50% CPU kubectl autoscale deployment php-apache \ --cpu-percent=50 \ --min=1 \ --max=20 # Check HPA status kubectl get hpa php-apache # NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE # php-apache Deployment/php-apache 0%/50% 1 20 1 30s
# Generate load from a separate terminal kubectl run load-gen --image=busybox:1.36 --restart=Never -- \ /bin/sh -c "while true; do wget -q -O- http://php-apache; done" # Watch scaling in real-time (in your main terminal) kubectl get hpa php-apache -w # NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS # php-apache Deployment/php-apache 248%/50% 1 20 1 # php-apache Deployment/php-apache 248%/50% 1 20 5 # php-apache Deployment/php-apache 76%/50% 1 20 8 # php-apache Deployment/php-apache 49%/50% 1 20 8 # Stop the load generator kubectl delete pod load-gen # After ~5 minutes, HPA scales back down kubectl get hpa php-apache # TARGETS REPLICAS # 0%/50% 1
Enable KEDA Addon
# Enable KEDA on your AKS cluster az aks update \ --resource-group myResourceGroup \ --name myAKSCluster \ --enable-keda # Verify KEDA operator is running kubectl get pods -n kube-system -l app=keda-operator # NAME READY STATUS RESTARTS # keda-operator-7c8d6b5f4d-x9k2j 1/1 Running 0 # Check KEDA CRDs are installed kubectl get crd | grep keda # scaledobjects.keda.sh # scaledjobs.keda.sh # triggerauthentications.keda.sh
Create a PodDisruptionBudget
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: api-serverkubectl apply -f pdb.yaml kubectl get pdb -n production # NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE # api-pdb 2 N/A 1 5s
🐛 Debugging Scenarios
Scenario 1: Cluster Autoscaler Not Adding Nodes
Symptom: Pods are stuck in Pending state but no new nodes appear even though autoscaler is enabled.
# Step 1: Confirm pods are Pending due to insufficient resources
kubectl get pods --field-selector=status.phase=Pending
kubectl describe pod <pending-pod>
# Look for: "0/5 nodes are available: 5 Insufficient cpu"
# Step 2: Verify autoscaler is enabled and not at max
az aks nodepool show \
-g myResourceGroup --cluster-name myAKSCluster -n nodepool1 \
--query "{autoScaling:enableAutoScaling, min:minCount, max:maxCount, current:count}"
# If current == max, increase --max-count
# Step 3: Check Azure VM quota in the region
az vm list-usage --location eastus -o table | Select-String "Standard DS"
# If "CurrentValue" equals "Limit" — request a quota increase
# Step 4: Check for cluster-autoscaler events
kubectl get events --field-selector reason=ScaleUp -A --sort-by='.lastTimestamp'
kubectl get events --field-selector reason=NotTriggerScaleUp -A
# Step 5: Check autoscaler status configmap
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml
# Look for "ScaleUp" status, "backoff" entries, or "noScaleUp" reasons
# Step 6: If using Spot VMs, check if capacity is available
# Switch expander to allow fallback to regular pools
az aks update -g myResourceGroup -n myAKSCluster \
--cluster-autoscaler-profile expander=priorityScenario 2: HPA Not Scaling Up
Symptom: kubectl get hpa shows <unknown>/50% in the TARGETS column — HPA can't read metrics.
# Step 1: Check if metrics-server is running
kubectl get deployment metrics-server -n kube-system
kubectl top nodes # Should show CPU/memory for each node
kubectl top pods # Should show CPU/memory for each pod
# Step 2: If metrics-server is not running or unhealthy
kubectl get pods -n kube-system -l k8s-app=metrics-server
kubectl logs -n kube-system -l k8s-app=metrics-server --tail=50
# Step 3: Verify the deployment has resource requests set
kubectl get deployment api-server -o jsonpath='{.spec.template.spec.containers[0].resources}'
# If empty — HPA cannot calculate %. Add resources.requests:
kubectl set resources deployment api-server --requests=cpu=200m,memory=256Mi
# Step 4: Check HPA events for specific errors
kubectl describe hpa api-hpa
# Look for: "failed to get cpu utilization" or "missing request for cpu"
# Step 5: If using custom metrics, verify the adapter
kubectl get apiservice v1beta1.custom.metrics.k8s.io -o yaml
# If status shows "False" — the custom metrics adapter is broken
# Step 6: After fixing, watch HPA recover
kubectl get hpa api-hpa -wScenario 3: Scale-Down Too Aggressive — Pods Getting Evicted
Symptom: The autoscaler keeps removing nodes during business hours, causing brief disruptions as pods are rescheduled.
# Step 1: Check current autoscaler profile
az aks show -g myResourceGroup -n myAKSCluster \
--query "autoScalerProfile" -o table
# Step 2: Increase scale-down cooldowns
az aks update -g myResourceGroup -n myAKSCluster \
--cluster-autoscaler-profile \
scale-down-delay-after-add=15m \
scale-down-unneeded-time=15m \
scale-down-utilization-threshold=0.7
# Step 3: Ensure PodDisruptionBudgets are set for critical services
kubectl get pdb -A
# If missing, create PDBs with minAvailable for each stateful/critical service
# Step 4: Annotate pods that should NOT be evicted
kubectl annotate pod important-pod-xyz \
"cluster-autoscaler.kubernetes.io/safe-to-evict"="false"
# Step 5: Use over-provisioning — deploy low-priority "pause" pods
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: overprovisioning
value: -1
globalDefault: false
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning
spec:
replicas: 2
selector:
matchLabels:
run: overprovisioning
template:
metadata:
labels:
run: overprovisioning
spec:
priorityClassName: overprovisioning
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: "1"
memory: 2Gi
EOF
# These pods hold capacity. When real pods need space, pause pods are evicted instantly
# — no waiting for new nodes to boot.Over-provisioning with low-priority pause pods is the single most effective technique for reducing scale-up latency. Real pods preempt pause pods instantly, while the CA adds replacement capacity in the background.
🎯 Interview Questions
Beginner
Manual scaling requires a human to run az aks scale (nodes) or kubectl scale (pods) to change capacity. Autoscaling automatically adjusts capacity based on signals: HPA scales pods based on CPU/memory, Cluster Autoscaler adds/removes nodes based on pending pods, and KEDA scales based on external events. Autoscaling is essential for workloads with variable traffic.
The Cluster Autoscaler (CA) watches for pods in Pending state that can't be scheduled because no node has enough resources. When it finds such pods, it calculates which node pool can fit them and requests a new VM from Azure. The new node joins the cluster and the pending pod gets scheduled. CA checks this every scan-interval (default 10 seconds).
By default, HPA uses CPU and memory utilization from the metrics-server. You specify a target percentage (e.g., 70% CPU). HPA calculates: desiredReplicas = ceil(currentReplicas × currentMetricValue / targetMetricValue). It also supports custom metrics (via Prometheus adapter) and external metrics (via KEDA or Azure Monitor adapter).
A PodDisruptionBudget (PDB) tells Kubernetes the minimum number of replicas that must remain available during voluntary disruptions (like node drain during scale-down or upgrades). Without PDBs, the Cluster Autoscaler could evict all replicas of a service simultaneously, causing downtime. PDBs ensure graceful scale-down.
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA by adding support for external event sources as scaling triggers — such as Azure Service Bus queue depth, Storage Queue length, or cron schedules. Standard HPA only scales on pod CPU/memory or custom metrics from inside the cluster. KEDA can also scale to zero, which standard HPA cannot (minimum is 1 replica).
Intermediate
They operate at different layers and complement each other. HPA increases pod replicas when CPU/memory is high. If the cluster doesn't have enough node capacity for the new pods, they go Pending. The Cluster Autoscaler detects the Pending pods and adds nodes. When load drops, HPA reduces replicas, nodes become underutilized, and CA removes them. The chain is: metrics → HPA → more pods → Pending → CA → more nodes.
random: picks a random eligible node pool (default). most-pods: picks the pool that can schedule the most pending pods. least-waste: picks the pool with the least idle CPU/memory after scheduling. priority: uses a ConfigMap to define pool priority order — ideal for preferring Spot pools with fallback to on-demand. The priority expander is most commonly used in production.
HPA and VPA both react to the same signals but take opposite actions. If both target CPU: HPA adds pods to reduce per-pod CPU utilization, while VPA increases resource requests per pod. VPA's increase makes HPA think utilization dropped (larger denominator), so HPA scales down. This creates oscillation. Best practice: use HPA for CPU-based scaling, VPA in "Off" mode for recommendation only, or VPA only for memory.
--max-surge controls how many extra nodes are created during a rolling upgrade. For example, --max-surge 1 adds 1 extra node, cordons an old node, drains its pods to the new one, then deletes the old node. Higher values speed up upgrades but consume more Azure quota temporarily. Use --max-surge 33% for a balance of speed and resource usage. For Spot pools, surge nodes may not be available, so the upgrade may stall.
Over-provisioning deploys low-priority "pause" pods (using registry.k8s.io/pause) that reserve node capacity. These pods have a PriorityClass with a very low value (e.g., -1). When real workload pods need scheduling, Kubernetes preempts the pause pods instantly — no VM boot delay. The CA then backfills by adding new nodes for the evicted pause pods. This turns a 3-5 minute scale-up wait into near-instant scheduling.
Scenario-Based
1) Deploy over-provisioning pause pods that reserve 10-15 nodes of headroom — real pods preempt them instantly. 2) Use KEDA with a cron trigger to pre-scale the deployment to 50% capacity 15 minutes before the sale. 3) Set aggressive HPA scaleUp behavior: stabilizationWindowSeconds: 0 with 100% scale-up policy. 4) Combine Spot and on-demand node pools with priority expander. 5) Pre-warm the cluster by bumping --min-count on the node pool before the event.
Possible causes: 1) scale-down-unneeded-time hasn't elapsed yet (default 10m, but may be set higher). 2) Pods on those nodes have safe-to-evict: false annotation. 3) Pods use local storage (emptyDir) and skip-nodes-with-local-storage is true. 4) A PodDisruptionBudget prevents eviction (PDB allows 0 disruptions). 5) Pods belong to DaemonSets (these don't block scale-down but increase utilization). Check with: kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml — it shows why each node is not eligible for removal.
1) Use the priority expander with Spot as primary and a regular pool as fallback. 2) Set tolerations on batch pods for the Spot taint (kubernetes.azure.com/scalesetpriority: spot). 3) Use multiple Spot pools with different VM sizes to increase capacity availability. 4) Design batch jobs to be idempotent and checkpoint-able so eviction doesn't lose progress. 5) Set terminationGracePeriodSeconds: 25 (Spot gets 30s notice) to save state. 6) Use KEDA ScaledJobs for batch processing — KEDA creates new Job pods to replace evicted ones.
1) Check if current replicas already equal maxReplicas: kubectl get hpa. 2) Check the scaleUp.stabilizationWindowSeconds — HPA waits this long before scaling to avoid flapping. 3) Check if the deployment has a spec.replicas that's been set manually — this can conflict with HPA. 4) Verify HPA behavior policies — a restrictive scaleUp.policies might limit how fast scaling happens. 5) Check HPA events: kubectl describe hpa for "FailedGetScale" or "ScalingLimited" events.
Use KEDA with two triggers: 1) azure-servicebus scaler targeting the queue, with messageCount: 10 (1 pod per 10 messages), minReplicaCount: 0 (scale to zero at night), maxReplicaCount: 50. 2) A cron trigger to pre-warm to 5 replicas at 7:45 AM before business hours start. Set cooldownPeriod: 300 to avoid premature scale-down between message bursts. Use workload identity for KEDA to authenticate to Service Bus. Pair with Cluster Autoscaler to handle the node-level scaling for up to 50 pods.
🌍 Real-World Use Case
E-Commerce Flash Sale — KEDA + Cluster Autoscaler
An online retailer handles 500 RPS normally but sees 10x traffic spikes during flash sales. Their scaling architecture:
- Baseline: 5 nodes (Standard_D4s_v5, 4 vCPU / 16 GB each), web pods min 6 replicas via HPA.
- Pre-sale: KEDA cron trigger scales web pods to 30 replicas at T-15 minutes. Cluster Autoscaler adds nodes to fit them.
- During sale: HPA targets 60% CPU with aggressive scale-up (100% increase per 30s). KEDA scales order-processor pods based on Service Bus queue depth (1 pod per 5 messages, max 100).
- Node pools: Primary on-demand pool (min 5, max 20) + Spot pool (min 0, max 30) with priority expander preferring Spot first.
- Over-provisioning: 3 pause pods reserving 3 nodes of headroom — instant scheduling for the first wave of scale-up.
- PDBs: Web pods
minAvailable: 4, checkout podsminAvailable: 2— ensures availability during autoscaler scale-down. - Post-sale: HPA/KEDA scale down over 10 minutes (stabilization window). CA removes idle nodes after 15 minutes. Spot pool drains to 0.
- Result: Zero downtime during 10x traffic spikes, 60% cost reduction with Spot nodes, 95th percentile latency under 200ms.
📝 Summary
- Manual scaling (
az aks scale,kubectl scale) is good for planned changes; autoscaling handles the unpredictable. - Cluster Autoscaler adds/removes nodes based on pending pods — configure min/max counts and profile settings carefully.
- HPA scales pod replicas on CPU/memory — requires resource requests to be set on pods.
- VPA right-sizes pod resource requests — use in "Off" mode alongside HPA to avoid conflicts.
- KEDA scales on external events (queues, cron, custom metrics) and can scale to zero — ideal for event-driven workloads.
- Over-provisioning with low-priority pause pods eliminates VM boot latency — schedule real pods instantly.
- PodDisruptionBudgets are mandatory for production — they prevent the autoscaler from causing downtime during scale-down.
- Combine HPA + CA + KEDA + Spot pools + PDBs for a cost-efficient, resilient autoscaling strategy.