Advanced Lesson 9 of 14

Scaling

Handle traffic spikes and optimize resource usage. HPA, VPA, Cluster Autoscaler, and scaling strategies.

🧒 Simple Explanation (ELI5)

Imagine a checkout counter at a store. During a sale, you open more counters (scale out). When it's quiet, you close extras (scale in). HPA is like the store manager who watches the queue length and opens/closes counters automatically. Cluster Autoscaler is like the building manager who adds more rooms when all existing counters are full.

🔧 Technical Explanation

Three Levels of Scaling

LevelWhat ScalesTool
Horizontal Pod Autoscaler (HPA)Number of pod replicaskubectl autoscale / HPA resource
Vertical Pod Autoscaler (VPA)CPU/memory requests per podVPA resource (addon)
Cluster AutoscalerNumber of nodes in the clusterCloud provider integration

📊 Visual: Scaling Architecture

Scaling Layers
High CPU/Memory
HPA: Add more pods
Pods: 3 → 8
No room for pods
Cluster Autoscaler: Add nodes
Nodes: 3 → 5
Pod needs more CPU
VPA: Increase pod resources
CPU: 100m → 500m

⌨️ Hands-on: Horizontal Pod Autoscaler

💡
Prerequisite

HPA requires metrics-server installed in the cluster: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Create a Deployment with resource requests

yaml
# deployment-for-hpa.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web
          image: nginx:1.25
          ports:
            - containerPort: 80
          resources:
            requests:
              cpu: "100m"
              memory: "64Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"

Create HPA

bash
# Imperative
kubectl autoscale deployment web-app --cpu-percent=50 --min=2 --max=10

# Verify
kubectl get hpa
kubectl describe hpa web-app

Declarative HPA with custom metrics

yaml
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Simulate Load and Watch Scaling

bash
# Generate load (from another terminal)
kubectl run load-gen --image=busybox --rm -it -- sh -c \
  "while true; do wget -q -O- http://web-app-service; done"

# Watch HPA respond
kubectl get hpa -w

# Watch pods scale
kubectl get pods -w

Cluster Autoscaler (AKS example)

bash
# Enable cluster autoscaler on AKS node pool
az aks nodepool update \
  --resource-group myRG \
  --cluster-name myCluster \
  --name nodepool1 \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 10

# When pods are Pending due to insufficient node capacity,
# the autoscaler adds nodes. When nodes are underutilized,
# it drains and removes them.

🐛 Debugging Scenarios

Scenario 1: High Traffic — HPA Not Scaling

bash
# Check HPA status
kubectl describe hpa web-app-hpa

# Common issues:
# - "unable to fetch metrics" → metrics-server not installed or not working
# - "missing request for cpu" → pods don't have resource requests set
# - Already at maxReplicas → increase max
# - Stabilization window preventing scale-up → check behavior config

# Verify metrics-server
kubectl top pods
kubectl top nodes

Scenario 2: Pods Pending After Scale-up — No Node Capacity

Cause: HPA added pods but nodes are full. Cluster autoscaler is either not enabled or too slow.

Scenario 3: Scaling Too Aggressively — Flapping

Symptom: HPA keeps scaling up and down rapidly (thrashing).

Fix: Increase stabilizationWindowSeconds for scaleDown (e.g., 300-600s). Adjust CPU target (60% instead of 50%). Add scaleDown policies to limit the rate.

🎯 Interview Questions

Beginner

Q: What is Horizontal Pod Autoscaling (HPA)?

HPA automatically scales the number of pod replicas based on observed metrics (CPU, memory, or custom metrics). When CPU usage exceeds the target (e.g., 50%), HPA adds more pods. When it drops, pods are removed. It requires metrics-server to be installed and resource requests to be set on pods.

Q: What is the difference between HPA and VPA?

HPA scales horizontally — adds/removes pod replicas. VPA scales vertically — adjusts CPU/memory requests for individual pods. HPA is for stateless apps that can run multiple instances. VPA is for apps that can't easily scale horizontally (single-instance databases, batch jobs).

Q: What is the Cluster Autoscaler?

The Cluster Autoscaler adjusts the number of nodes in a cluster. It adds nodes when pods can't be scheduled (Pending due to insufficient resources). It removes underutilized nodes when pods can be consolidated elsewhere. Works with cloud provider APIs (AKS, EKS, GKE).

Q: What is the prerequisite for HPA to work?

Two things: 1) metrics-server must be installed in the cluster (provides CPU/memory metrics). 2) Pods must have resource requests defined. Without requests, HPA can't calculate utilization percentage (it divides current usage by requested amount).

Q: How do you manually scale a Deployment?

kubectl scale deployment/my-app --replicas=5 or edit the deployment YAML and kubectl apply. Note: if HPA is active, it will override manual scaling to match its target. To manually override, pause or delete the HPA first.

Intermediate

Q: How does HPA calculate the desired replica count?

Formula: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric)). Example: 3 replicas at 80% CPU with 50% target → 3 × (80/50) = 4.8 → 5 replicas. HPA evaluates every 15 seconds by default. Multiple metrics: HPA calculates for each and takes the maximum.

Q: What is the stabilization window in HPA?

The stabilization window prevents rapid scaling decisions. For scale-down, it looks at the highest recommendation over the window period and uses the lowest to prevent premature scale-down. Default: 300s for scale-down, 0s for scale-up. Configurable in HPA behavior spec. Prevents "flapping" (rapid scale up/down cycles).

Q: Can HPA and VPA be used together?

Generally, don't use both on the same metric (CPU) for the same deployment — they'll conflict. However: you can use HPA on CPU and VPA on memory, or use VPA in "recommend-only" mode alongside HPA. The Multidimensional Pod Autoscaler (MPA) is a newer approach that combines both intelligently.

Q: What are custom metrics for HPA?

Beyond CPU/memory, HPA can scale on custom metrics (requests-per-second, queue depth, latency) via the custom metrics API. You need a metrics adapter (Prometheus Adapter, Datadog adapter) that translates application metrics into the K8s metrics API. Example: scale based on messages in a Kafka topic or pending jobs in a queue.

Q: How does the Cluster Autoscaler decide to remove a node?

A node is considered for removal if: 1) All pods can be moved to other nodes. 2) No pod has a PodDisruptionBudget that would be violated. 3) No pod is restricted to that node (nodeSelector, affinity). 4) The node has been underutilized (below 50% typically) for a sustained period (default 10 min). DaemonSet pods, mirror pods, and pods without a controller are considered carefully.

Scenario-Based

Q: Black Friday is coming. How do you prepare your Kubernetes cluster for 10x traffic?

1) Pre-scale: Increase minimum replicas before the event. Don't rely on autoscaling alone for predictable spikes. 2) Pre-provision nodes: Scale up the cluster autoscaler min count. New nodes take minutes to provision. 3) Test: Load test with realistic traffic patterns. 4) HPA tuning: Set aggressive scale-up policy (fast reaction). 5) Resource limits: Ensure pods have correct requests so scheduling is efficient. 6) PDB: Set Pod Disruption Budgets to maintain availability during node changes.

Q: HPA is configured at 50% CPU target but pods always show 48-52% and never scale. Is this correct?

Yes — HPA is working correctly. It's maintaining pods right around the target. HPA has a tolerance window (default ±10%) to avoid unnecessary scaling for small fluctuations. At 48-52% with a 50% target, the system is in steady state. This is healthy behavior. It would scale up if CPU consistently exceeds ~55% and scale down if below ~45%.

Q: Your app is memory-intensive and keeps getting OOMKilled even after scaling. What's wrong?

HPA adds more pods but each pod's memory limit is too low. Adding replicas doesn't help if each replica can't handle its share of work within its memory limit. Fix: 1) Increase memory limits per pod. 2) Use VPA to automatically right-size resource requests/limits. 3) Profile the app for memory leaks. 4) If it's a single process that needs more memory, VPA is the right tool — not HPA.

Q: Cluster autoscaler added a node but it took 5 minutes and some requests timed out. How do you improve this?

1) Over-provision: Use a low-priority "pause" pod that occupies space. When real pods need room, the pause pod is evicted instantly — no need to wait for new nodes. 2) Pre-scale nodes before expected traffic. 3) Use spot/preemptible instance pools for faster provisioning. 4) Keep warm nodes: Set cluster autoscaler min higher. 5) Node auto-provisioning (NAP): Some providers create optimally-sized node pools on demand.

Q: You want to scale based on messages in an SQS queue. How?

1) Deploy a metrics adapter like KEDA (Kubernetes Event-Driven Autoscaler) or Prometheus Adapter. 2) KEDA natively supports SQS as a trigger — configure a ScaledObject targeting your deployment with the SQS queue URL and threshold. 3) KEDA can scale to zero (unlike HPA) and creates an HPA under the hood. 4) The scaling is based on queue depth: more messages → more pods to process them.

🌍 Real-World Use Case

An online gaming platform handles 50K concurrent users normally, peaking to 500K during tournaments:

📝 Summary

← Back to Kubernetes Course