HPA requires metrics-server installed in the cluster: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Scaling
Handle traffic spikes and optimize resource usage. HPA, VPA, Cluster Autoscaler, and scaling strategies.
🧒 Simple Explanation (ELI5)
Imagine a checkout counter at a store. During a sale, you open more counters (scale out). When it's quiet, you close extras (scale in). HPA is like the store manager who watches the queue length and opens/closes counters automatically. Cluster Autoscaler is like the building manager who adds more rooms when all existing counters are full.
🔧 Technical Explanation
Three Levels of Scaling
| Level | What Scales | Tool |
|---|---|---|
| Horizontal Pod Autoscaler (HPA) | Number of pod replicas | kubectl autoscale / HPA resource |
| Vertical Pod Autoscaler (VPA) | CPU/memory requests per pod | VPA resource (addon) |
| Cluster Autoscaler | Number of nodes in the cluster | Cloud provider integration |
📊 Visual: Scaling Architecture
⌨️ Hands-on: Horizontal Pod Autoscaler
Create a Deployment with resource requests
# deployment-for-hpa.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 2
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web
image: nginx:1.25
ports:
- containerPort: 80
resources:
requests:
cpu: "100m"
memory: "64Mi"
limits:
cpu: "500m"
memory: "256Mi"
Create HPA
# Imperative kubectl autoscale deployment web-app --cpu-percent=50 --min=2 --max=10 # Verify kubectl get hpa kubectl describe hpa web-app
Declarative HPA with custom metrics
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
Simulate Load and Watch Scaling
# Generate load (from another terminal) kubectl run load-gen --image=busybox --rm -it -- sh -c \ "while true; do wget -q -O- http://web-app-service; done" # Watch HPA respond kubectl get hpa -w # Watch pods scale kubectl get pods -w
Cluster Autoscaler (AKS example)
# Enable cluster autoscaler on AKS node pool az aks nodepool update \ --resource-group myRG \ --cluster-name myCluster \ --name nodepool1 \ --enable-cluster-autoscaler \ --min-count 2 \ --max-count 10 # When pods are Pending due to insufficient node capacity, # the autoscaler adds nodes. When nodes are underutilized, # it drains and removes them.
🐛 Debugging Scenarios
Scenario 1: High Traffic — HPA Not Scaling
# Check HPA status kubectl describe hpa web-app-hpa # Common issues: # - "unable to fetch metrics" → metrics-server not installed or not working # - "missing request for cpu" → pods don't have resource requests set # - Already at maxReplicas → increase max # - Stabilization window preventing scale-up → check behavior config # Verify metrics-server kubectl top pods kubectl top nodes
Scenario 2: Pods Pending After Scale-up — No Node Capacity
Cause: HPA added pods but nodes are full. Cluster autoscaler is either not enabled or too slow.
- Check:
kubectl get events | grep FailedScheduling - Check:
kubectl describe nodes | grep -A 5 "Allocated resources" - Enable cluster autoscaler if not running
- Check autoscaler logs:
kubectl logs -n kube-system -l app=cluster-autoscaler
Scenario 3: Scaling Too Aggressively — Flapping
Symptom: HPA keeps scaling up and down rapidly (thrashing).
Fix: Increase stabilizationWindowSeconds for scaleDown (e.g., 300-600s). Adjust CPU target (60% instead of 50%). Add scaleDown policies to limit the rate.
🎯 Interview Questions
Beginner
HPA automatically scales the number of pod replicas based on observed metrics (CPU, memory, or custom metrics). When CPU usage exceeds the target (e.g., 50%), HPA adds more pods. When it drops, pods are removed. It requires metrics-server to be installed and resource requests to be set on pods.
HPA scales horizontally — adds/removes pod replicas. VPA scales vertically — adjusts CPU/memory requests for individual pods. HPA is for stateless apps that can run multiple instances. VPA is for apps that can't easily scale horizontally (single-instance databases, batch jobs).
The Cluster Autoscaler adjusts the number of nodes in a cluster. It adds nodes when pods can't be scheduled (Pending due to insufficient resources). It removes underutilized nodes when pods can be consolidated elsewhere. Works with cloud provider APIs (AKS, EKS, GKE).
Two things: 1) metrics-server must be installed in the cluster (provides CPU/memory metrics). 2) Pods must have resource requests defined. Without requests, HPA can't calculate utilization percentage (it divides current usage by requested amount).
kubectl scale deployment/my-app --replicas=5 or edit the deployment YAML and kubectl apply. Note: if HPA is active, it will override manual scaling to match its target. To manually override, pause or delete the HPA first.
Intermediate
Formula: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric)). Example: 3 replicas at 80% CPU with 50% target → 3 × (80/50) = 4.8 → 5 replicas. HPA evaluates every 15 seconds by default. Multiple metrics: HPA calculates for each and takes the maximum.
The stabilization window prevents rapid scaling decisions. For scale-down, it looks at the highest recommendation over the window period and uses the lowest to prevent premature scale-down. Default: 300s for scale-down, 0s for scale-up. Configurable in HPA behavior spec. Prevents "flapping" (rapid scale up/down cycles).
Generally, don't use both on the same metric (CPU) for the same deployment — they'll conflict. However: you can use HPA on CPU and VPA on memory, or use VPA in "recommend-only" mode alongside HPA. The Multidimensional Pod Autoscaler (MPA) is a newer approach that combines both intelligently.
Beyond CPU/memory, HPA can scale on custom metrics (requests-per-second, queue depth, latency) via the custom metrics API. You need a metrics adapter (Prometheus Adapter, Datadog adapter) that translates application metrics into the K8s metrics API. Example: scale based on messages in a Kafka topic or pending jobs in a queue.
A node is considered for removal if: 1) All pods can be moved to other nodes. 2) No pod has a PodDisruptionBudget that would be violated. 3) No pod is restricted to that node (nodeSelector, affinity). 4) The node has been underutilized (below 50% typically) for a sustained period (default 10 min). DaemonSet pods, mirror pods, and pods without a controller are considered carefully.
Scenario-Based
1) Pre-scale: Increase minimum replicas before the event. Don't rely on autoscaling alone for predictable spikes. 2) Pre-provision nodes: Scale up the cluster autoscaler min count. New nodes take minutes to provision. 3) Test: Load test with realistic traffic patterns. 4) HPA tuning: Set aggressive scale-up policy (fast reaction). 5) Resource limits: Ensure pods have correct requests so scheduling is efficient. 6) PDB: Set Pod Disruption Budgets to maintain availability during node changes.
Yes — HPA is working correctly. It's maintaining pods right around the target. HPA has a tolerance window (default ±10%) to avoid unnecessary scaling for small fluctuations. At 48-52% with a 50% target, the system is in steady state. This is healthy behavior. It would scale up if CPU consistently exceeds ~55% and scale down if below ~45%.
HPA adds more pods but each pod's memory limit is too low. Adding replicas doesn't help if each replica can't handle its share of work within its memory limit. Fix: 1) Increase memory limits per pod. 2) Use VPA to automatically right-size resource requests/limits. 3) Profile the app for memory leaks. 4) If it's a single process that needs more memory, VPA is the right tool — not HPA.
1) Over-provision: Use a low-priority "pause" pod that occupies space. When real pods need room, the pause pod is evicted instantly — no need to wait for new nodes. 2) Pre-scale nodes before expected traffic. 3) Use spot/preemptible instance pools for faster provisioning. 4) Keep warm nodes: Set cluster autoscaler min higher. 5) Node auto-provisioning (NAP): Some providers create optimally-sized node pools on demand.
1) Deploy a metrics adapter like KEDA (Kubernetes Event-Driven Autoscaler) or Prometheus Adapter. 2) KEDA natively supports SQS as a trigger — configure a ScaledObject targeting your deployment with the SQS queue URL and threshold. 3) KEDA can scale to zero (unlike HPA) and creates an HPA under the hood. 4) The scaling is based on queue depth: more messages → more pods to process them.
🌍 Real-World Use Case
An online gaming platform handles 50K concurrent users normally, peaking to 500K during tournaments:
- HPA: Game servers scale 5→50 replicas based on custom metric (active connections per pod)
- Cluster Autoscaler: Node pool scales 10→100 nodes (using spot instances for cost)
- Pre-scaling: 1 hour before known tournaments, minimum replicas are bumped to 30
- KEDA: Backend processing workers scale on message queue depth
- Scale-down: Conservative 10-minute stabilization window prevents premature scale-down between tournament rounds
📝 Summary
- HPA scales pod replicas based on CPU/memory/custom metrics
- VPA adjusts resource requests per pod — use for single-instance workloads
- Cluster Autoscaler adds/removes nodes when capacity is needed
- HPA requires metrics-server and resource requests on pods
- For predictable traffic spikes, pre-scale — don't rely solely on autoscaling
- Use KEDA for event-driven scaling (queues, cron, external metrics)