Basics Lesson 4 of 14

Node Pools

System vs user node pools, VM size selection, spot instances for massive savings, taints, labels, and multi-pool strategies for production workloads.

🧒 Simple Explanation (ELI5)

Think of node pools like different teams in a company:

You wouldn't put the IT team and the factory workers in the same room. Similarly, you separate Kubernetes system components from your application workloads into different node pools.

🔧 Technical Explanation

System vs User Node Pools

Every AKS cluster requires at least one system node pool. System node pools run critical Kubernetes components.

AspectSystem Node PoolUser Node Pool
PurposeRuns Kubernetes system pods (CoreDNS, metrics-server, kube-proxy, CSI drivers)Runs your application workloads
Required?Yes — at least one system pool must existOptional (but recommended for production)
Default taintCriticalAddonsOnly=true:NoSchedule (when dedicated system pool)None
Minimum nodes1 (dev) or 3 (production with zones)0 (can scale to zero)
Scale to zero?No — cluster breaks without system podsYes — great for cost savings
VM size recommendationStandard_D2s_v5 (2 vCPU, 8 GB)Depends on workload
CriticalAddonsOnly Taint

When you mark a node pool as mode=System, AKS applies the taint CriticalAddonsOnly=true:NoSchedule. This prevents your application pods from scheduling on system nodes — protecting system components from resource contention. Only pods with a matching toleration (which AKS system pods have) can schedule there. This separation is a best practice for production.

VM Sizes for Different Workloads

Workload TypeRecommended VM SeriesExample SKUSpecs~Monthly Cost
System poolDsv5 (general purpose)Standard_D2s_v52 vCPU, 8 GB RAM~$70
Web APIs / microservicesDsv5 (general purpose)Standard_D4s_v54 vCPU, 16 GB RAM~$140
Memory-intensive (caching, search)Esv5 (memory optimized)Standard_E4s_v54 vCPU, 32 GB RAM~$185
CPU-intensive (batch, encoding)Fsv2 (compute optimized)Standard_F8s_v28 vCPU, 16 GB RAM~$245
AI/ML trainingNC-series (GPU)Standard_NC6s_v36 vCPU, 112 GB RAM, 1× V100~$2,200
Dev/testBs-series (burstable)Standard_B2s2 vCPU, 4 GB RAM~$30
Windows workloadsDsv5 (general purpose)Standard_D4s_v54 vCPU, 16 GB RAM~$180 (Windows license)

Spot Node Pools

Spot instances use Azure's spare compute capacity at up to 90% discount. The catch: Azure can evict your nodes with 30 seconds notice when it needs the capacity back.

SettingDescription
--priority SpotCreates a spot node pool (discounted VMs)
--eviction-policy DeleteEvicted VMs are deleted (recommended). Alternative: Deallocate (preserves OS disk)
--spot-max-price -1Pay market price (recommended). Or set a cap: --spot-max-price 0.05
⚠️
Spot Pools Are Not For Everything

Only run workloads on spot nodes that can tolerate interruption: batch jobs, CI/CD runners, stateless workers, dev/test, data processing. Never run your production API or database on spot nodes. AKS applies a kubernetes.azure.com/scalesetpriority:spot taint — your pods need a matching toleration.

Node Pool Scaling Options

MethodHowUse Case
Manual scalingaz aks nodepool scale --node-count 5Known capacity needs, planned events
Cluster autoscaler--enable-cluster-autoscaler --min-count 2 --max-count 10Variable traffic, auto-adjust to demand
Scale to zero--min-count 0 (user pools only)GPU/spot pools that aren't always needed

Taints and Labels

Taints and labels on node pools control which pods schedule where:

OS SKU Options

OS SKUDescriptionWhen to Use
UbuntuDefault Linux OS for AKS nodes. Battle-tested, broad compatibility.Default choice for most workloads
AzureLinuxMicrosoft's Linux distro (formerly CBL-Mariner). Smaller image, faster boot, more secure.Performance-sensitive or security-hardened clusters
Windows2022Windows Server node pool. Runs Windows containers..NET Framework apps, Windows-only workloads

Max Pods Per Node

The --max-pods setting determines how many pods a single node can run:

📊 Multi-Pool Architecture

Production Node Pool Strategy
System Pool (Standard_D2s_v5)
CoreDNS
metrics-server
kube-proxy
CSI drivers
Taint: CriticalAddonsOnly
App Pool (Standard_D4s_v5)
API pods
Web frontend pods
Worker pods
Autoscaler: 3-10 nodes
Spot Pool (Standard_D4s_v5)
Batch jobs
CI runners
Data processing
Taint: spot=true:NoSchedule
~90% cheaper
GPU Pool (Standard_NC6s_v3)
ML training jobs
Inference pods
Taint: gpu=true:NoSchedule
Scale to zero when idle
Taint + Toleration + nodeSelector Flow
Pod with
toleration: gpu=true
nodeSelector: pool=gpu
→ Scheduler checks →
App Pool ✗ (no gpu label)
Spot Pool ✗ (no gpu label)
GPU Pool ✓ (toleration + label match)

⌨️ Hands-on

List Existing Node Pools

bash
# List all node pools in your cluster
az aks nodepool list --resource-group rg-dev --cluster-name dev-cluster -o table

# Example output:
# Name        OsType    VmSize           Count  Mode    OrchestratorVersion
# ----------  --------  ---------------  -----  ------  -------------------
# agentpool   Linux     Standard_D2s_v5  2      System  1.29.2

Add a User Node Pool

bash
# Add a user pool for application workloads
az aks nodepool add \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name apppool \
  --mode User \
  --node-count 3 \
  --node-vm-size Standard_D4s_v5 \
  --max-pods 110 \
  --zones 1 2 3 \
  --labels workload=app environment=dev \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 8 \
  --os-sku AzureLinux

# Verify the pool was added
az aks nodepool list -g rg-dev --cluster-name dev-cluster -o table
kubectl get nodes -l agentpool=apppool

Add a Spot Node Pool

bash
# Add a spot pool for batch/CI workloads (up to 90% cheaper)
az aks nodepool add \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name spotpool \
  --mode User \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --node-count 2 \
  --node-vm-size Standard_D4s_v5 \
  --max-pods 110 \
  --labels workload=batch priority=spot \
  --node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" \
  --enable-cluster-autoscaler \
  --min-count 0 \
  --max-count 10

# AKS automatically adds the spot taint, but adding it explicitly ensures clarity
# To schedule pods on spot nodes, add this toleration to your pod spec:
# tolerations:
# - key: "kubernetes.azure.com/scalesetpriority"
#   operator: "Equal"
#   value: "spot"
#   effect: "NoSchedule"

Add a GPU Node Pool

bash
# Add a GPU pool for ML workloads (scale to zero when not training)
az aks nodepool add \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name gpupool \
  --mode User \
  --node-count 0 \
  --node-vm-size Standard_NC6s_v3 \
  --node-taints "sku=gpu:NoSchedule" \
  --labels workload=ml accelerator=nvidia \
  --enable-cluster-autoscaler \
  --min-count 0 \
  --max-count 3

# When a pod with matching toleration + GPU resource request appears,
# the autoscaler spins up a GPU node. When job finishes, scales back to 0.

Scale a Node Pool

bash
# Manual scale — set exact node count
az aks nodepool scale \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name apppool \
  --node-count 5

# Update autoscaler limits
az aks nodepool update \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name apppool \
  --update-cluster-autoscaler \
  --min-count 3 \
  --max-count 15

# Disable autoscaler (switch to manual)
az aks nodepool update \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name apppool \
  --disable-cluster-autoscaler

Inspect Nodes and Labels

bash
# List nodes with their pool, VM size, and zone
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
POOL:.metadata.labels.agentpool,\
VM:.metadata.labels.node\\.kubernetes\\.io/instance-type,\
ZONE:.metadata.labels.topology\\.kubernetes\\.io/zone,\
STATUS:.status.conditions[-1].type

# Example output:
# NAME                               POOL       VM                ZONE       STATUS
# aks-agentpool-12345-vmss000000     agentpool  Standard_D2s_v5   eastus-1   Ready
# aks-agentpool-12345-vmss000001     agentpool  Standard_D2s_v5   eastus-2   Ready
# aks-apppool-67890-vmss000000       apppool    Standard_D4s_v5   eastus-1   Ready
# aks-apppool-67890-vmss000001       apppool    Standard_D4s_v5   eastus-2   Ready
# aks-apppool-67890-vmss000002       apppool    Standard_D4s_v5   eastus-3   Ready

# Check taints on a node
kubectl describe node aks-agentpool-12345-vmss000000 | grep -A 3 "Taints:"
# Taints: CriticalAddonsOnly=true:NoSchedule

# List all labels on a specific node pool's nodes
kubectl get nodes -l agentpool=spotpool --show-labels

Deploy a Pod to a Specific Node Pool

yaml
# deploy-to-apppool.yaml — target the app pool using nodeSelector
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-api
  template:
    metadata:
      labels:
        app: web-api
    spec:
      nodeSelector:
        agentpool: apppool          # targets the apppool node pool
      containers:
      - name: web-api
        image: myacr.azurecr.io/web-api:v1.2
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: "1"
            memory: 1Gi
yaml
# batch-job-on-spot.yaml — schedule job on spot nodes with toleration
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing
spec:
  template:
    spec:
      nodeSelector:
        agentpool: spotpool           # target spot pool
      tolerations:
      - key: "kubernetes.azure.com/scalesetpriority"
        operator: "Equal"
        value: "spot"
        effect: "NoSchedule"
      containers:
      - name: processor
        image: myacr.azurecr.io/data-processor:v2.0
        resources:
          requests:
            cpu: "2"
            memory: 4Gi
      restartPolicy: OnFailure

Upgrade a Node Pool

bash
# Upgrade a specific node pool to a new K8s version
az aks nodepool upgrade \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name apppool \
  --kubernetes-version 1.30.0

# Upgrade node image only (no K8s version change — just OS patches)
az aks nodepool upgrade \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name apppool \
  --node-image-only

# Check current node image version
az aks nodepool show -g rg-dev --cluster-name dev-cluster -n apppool \
  --query nodeImageVersion -o tsv

🐛 Debugging Scenarios

Scenario 1: "Pods stuck in Pending — no matching node pool"

bash
# Step 1: Check the pod events
kubectl describe pod <pod-name> | grep -A 10 "Events:"
# Look for: "0/5 nodes are available: 3 node(s) had untolerated taint..."

# Step 2: Check what nodeSelector/tolerations the pod requires
kubectl get pod <pod-name> -o jsonpath='{.spec.nodeSelector}'
kubectl get pod <pod-name> -o jsonpath='{.spec.tolerations}'

# Step 3: Check what taints exist on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Example output:
# NAME                            TAINTS
# aks-agentpool-...-vmss000000    [map[effect:NoSchedule key:CriticalAddonsOnly value:true]]
# aks-apppool-...-vmss000000      <none>

# Step 4: Common causes:
# - Pod has nodeSelector for a pool that scaled to zero → wait for autoscaler
# - Pod needs GPU but targets wrong pool → fix nodeSelector
# - All user pools are tainted and pod has no toleration → add toleration
# - Pod requests more CPU/memory than any node can provide → use larger VM size

# Step 5: If autoscaler should scale up but doesn't, check its status
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50 | grep -i "scale"

Scenario 2: "Spot node was evicted — pods rescheduled but some data was lost"

bash
# Step 1: Confirm eviction happened
kubectl get events --sort-by='.lastTimestamp' | grep -i "evict\|preempt\|spot"

# Step 2: Check which nodes were affected
kubectl get nodes -l kubernetes.azure.com/scalesetpriority=spot -o wide
# Evicted nodes will be gone; replacements may already be provisioning

# Step 3: Check pod status — pods should reschedule on surviving nodes
kubectl get pods -o wide | grep -v Running

# Step 4: Data loss root cause — spot pods used emptyDir or local volumes
# Fix: Use PersistentVolumeClaims with Azure Disks or Azure Files
# These survive node evictions because the data is on Azure storage, not the VM

# Step 5: Add a PodDisruptionBudget to ensure minimum availability
# apiVersion: policy/v1
# kind: PodDisruptionBudget
# metadata:
#   name: processor-pdb
# spec:
#   minAvailable: 1
#   selector:
#     matchLabels:
#       app: data-processor

# Step 6: Ensure your workload handles SIGTERM gracefully
# Spot evictions send SIGTERM → 30 second grace period → SIGKILL
# Your app should checkpoint or save state within those 30 seconds

Scenario 3: "Node pool add fails with 'insufficient quota'"

bash
# Step 1: Check your current quota usage
az vm list-usage --location eastus -o table | grep -i "total\|standard D\|standard NC"

# Example output:
# CurrentValue  Limit    Name
# 12            20       Total Regional vCPUs
# 8             20       Standard DSv5 Family vCPUs
# 0             0        Standard NCSv3 Family vCPUs  ← GPU quota is 0!

# Step 2: Request a quota increase
# Azure Portal → Subscriptions → Usage + Quotas → Request Increase
# Or via CLI:
az quota create --resource-name "StandardNCSv3Family" \
  --scope "/subscriptions/{sub-id}/providers/Microsoft.Compute/locations/eastus" \
  --limit-object value=12

# Step 3: For GPU VMs, quota increases may take 1-2 business days
# Workaround: try a different region with available capacity

# Step 4: Verify quota was increased before retrying
az vm list-usage --location eastus -o table | grep "NC"

Scenario 4: "Pods scheduled on system pool despite having a user pool"

bash
# Step 1: Check if the system pool has the CriticalAddonsOnly taint
kubectl describe node aks-agentpool-12345-vmss000000 | grep -A 3 "Taints:"
# If "Taints: <none>" — the system pool isn't tainted

# Step 2: The default node pool created with az aks create is mode=System
# but doesn't have the taint unless you have a separate user pool
# When only one pool exists, all pods schedule there (including yours)

# Step 3: To enforce separation, make sure you have both pools:
az aks nodepool list -g rg-dev --cluster-name dev-cluster -o table
# If only one pool, add a user pool (see Hands-on section above)

# Step 4: Add the taint to the system pool manually if needed
az aks nodepool update \
  --resource-group rg-dev \
  --cluster-name dev-cluster \
  --name agentpool \
  --node-taints "CriticalAddonsOnly=true:NoSchedule"

# Step 5: Verify pods migrate to the user pool
kubectl get pods -o wide
# Pods without the toleration will be evicted from system nodes
# and rescheduled on user pool nodes

🎯 Interview Questions

Beginner

Q: What is a node pool in AKS?

A node pool is a group of nodes (Azure VMs) with identical configuration — same VM size, OS, and Kubernetes version. Each node pool maps to a VM Scale Set in the _MC_ resource group. AKS clusters can have multiple node pools with different configurations, allowing you to run different workload types on optimized hardware.

Q: What is the difference between system and user node pools?

System pools run Kubernetes system components (CoreDNS, metrics-server, kube-proxy, CSI drivers). At least one system pool must exist. They can't scale to zero. They have the CriticalAddonsOnly=true:NoSchedule taint to prevent application pods from scheduling there. User pools run your application workloads. They're optional, can scale to zero, and have no default taints. Best practice: separate system and user pools to prevent application resource contention from affecting system stability.

Q: What are spot node pools and when should you use them?

Spot node pools use Azure's spare compute capacity at up to 90% discount. Tradeoff: Azure can evict these nodes with 30 seconds notice when it needs the capacity. Use for: batch processing, CI/CD runners, dev/test environments, data pipelines, ML training with checkpointing, and any workload that can tolerate interruption. Never use for: production APIs, databases, or any workload that can't handle sudden termination.

Q: Can a user node pool scale to zero?

Yes. User node pools can be configured with --min-count 0 when cluster autoscaler is enabled. This is especially useful for GPU or spot pools that aren't always needed. When a pod requests resources that match the pool (via nodeSelector or toleration), the autoscaler spins up a node. When the workload completes and no pods need the pool, it scales back to zero. System pools can never scale to zero.

Q: How do you control which pods go to which node pool?

Three mechanisms: 1) nodeSelector — simple key-value match: nodeSelector: { agentpool: apppool }. 2) Taints + Tolerations — nodes repel pods unless they have a matching toleration. Used for system pool separation and spot/GPU pool isolation. 3) Node Affinity — more expressive rules (preferred vs required, multiple conditions). In practice, most teams use nodeSelector + taints for pool targeting.

Intermediate

Q: Why can't you change max-pods on an existing node pool?

The max-pods setting determines the IP allocation and routing configuration for each node at creation time. With Azure CNI, each pod pre-allocates a VNet IP — changing max-pods would require re-IPing all pods and reconfiguring VMSS networking, which isn't safe to do in-place. To change max-pods, create a new node pool with the desired setting, cordon + drain the old pool, and delete it. This is a design constraint of how Azure CNI allocates IPs.

Q: How does the cluster autoscaler decide when to scale a node pool?

Scale up: When pods are Pending because no node has enough resources (CPU/memory) to schedule them. The autoscaler simulates adding a node and checks if the pending pods would fit. Scale down: When a node's utilization (requested resources / allocatable) drops below ~50% (default) for 10+ minutes, and all pods on that node can be moved elsewhere. Nodes with local storage, pods without controllers, or pods with restrictive PDBs won't be scaled down. Scale-down is conservative to avoid thrashing.

Q: What happens to pods when a spot node gets evicted?

Azure sends a 30-second eviction notice. The node is drained: kubelet sends SIGTERM to all pod containers, waits for the grace period (default 30s), then SIGKILL. If eviction-policy=Delete, the VM is deleted entirely. The pods' controller (Deployment, Job, etc.) detects the pod termination and creates replacement pods — the scheduler places them on available non-spot nodes or surviving spot nodes. In-memory data and emptyDir volumes are lost. PersistentVolumes backed by Azure Disks survive and re-attach.

Q: When would you use Windows node pools in AKS?

Windows node pools are for running Windows containers — typically legacy .NET Framework applications that can't run on Linux. Key constraints: Windows pools can only be user pools (system pool must be Linux), they cost more (Windows Server license included in VM pricing), have fewer AKS features (no Azure Linux, limited network policies), and have slower node startup. If your .NET app targets .NET 6+ (or later), containerize it on Linux instead — it's cheaper, faster, and has better AKS support.

Q: How do you handle node pool upgrades in production without downtime?

AKS performs rolling upgrades by default: 1) A new node with the new version is added (surge node). 2) An old node is cordoned (no new pods). 3) Pods are drained (evicted) from the old node. 4) Old node is deleted. This repeats for each node. Configure max surge: --max-surge 1 (one node at a time, conservative) or --max-surge 33% (faster but uses more temp resources). Ensure PDBs allow at least one pod to be evicted. The process is automatic — you just trigger the upgrade command.

Scenario-Based

Q: Your company runs a web app (3 replicas), a batch processing pipeline (variable load), and occasional ML training jobs. Design the node pool strategy.

Three pools: 1) System pool: 2× Standard_D2s_v5, mode=System, always-on for K8s system pods. 2) App pool: Standard_D4s_v5, autoscaler min=2 max=6, mode=User — runs the web app with nodeSelector. 3) Spot pool: Standard_D4s_v5, priority=Spot, autoscaler min=0 max=10 — runs batch jobs with tolerations. 4) GPU pool: Standard_NC6s_v3, autoscaler min=0 max=2, taint sku=gpu:NoSchedule — ML training only, scales to zero when no training jobs. Total cost optimization: web app on reliable VMs, batch on cheap spot, GPU only when needed.

Q: Pods are stuck Pending. kubectl describe shows "0/5 nodes are available: 2 node(s) had untolerated taint {CriticalAddonsOnly: true}, 3 node(s) didn't match Pod's node affinity/selector." Diagnose and fix.

The message tells you: 2 system nodes are blocked by the CriticalAddonsOnly taint (correct behavior — app pods shouldn't go there). 3 other nodes exist but don't match the pod's nodeSelector or affinity rule. Fix: Check the pod's nodeSelector: kubectl get pod <pod> -o jsonpath='{.spec.nodeSelector}'. It probably targets a pool that either doesn't exist yet (scale-to-zero), has wrong labels, or was deleted. Either: a) Fix the deployment's nodeSelector to target an existing pool, b) Create the expected pool with matching labels, or c) If using autoscaler at min=0, wait for scale-up (check autoscaler logs for errors).

Q: Your spot node pool keeps getting evicted during peak hours (2-4 PM), disrupting batch jobs. How do you improve reliability?

1) Schedule batch jobs outside peak hours (night/weekends) using CronJobs. 2) Use multiple spot VM sizes: --node-vm-size with a VMSS flexible orchestration (if available) or create multiple spot pools with different SKUs — eviction risk is per-SKU. 3) Implement job checkpointing so interrupted jobs resume from last checkpoint. 4) Consider a mixed strategy: some critical batch runs on regular user pool nodes, non-critical on spot. 5) Set --spot-max-price slightly higher than average to reduce eviction (but still cheaper than regular). 6) Use a different region with more spare capacity during those hours.

Q: You have a single node pool with 30 max-pods. Your team wants to deploy 50 microservices with 2 replicas each (100 pods). The 3 nodes can only run 90 pods total. What do you do?

Since max-pods cannot be changed on an existing pool: 1) Create a new node pool with --max-pods 110: az aks nodepool add --name newpool --max-pods 110. 2) Cordon the old pool: kubectl cordon <old-nodes>. 3) Drain pods from old nodes: kubectl drain <old-node> --ignore-daemonsets --delete-emptydir-data. 4) Delete the old pool: az aks nodepool delete --name agentpool. Now 3 nodes × 110 max-pods = 330 pod capacity — more than enough. Lesson learned: Always set max-pods=110 at cluster creation. The default 30 for Azure CNI is too low for most production clusters.

Q: Your production cluster costs $5,000/month. Management wants to cut it by 40%. What node pool optimizations do you recommend?

1) Spot pools for non-critical workloads: Move batch processing, CI runners, and background workers to spot (saves up to 90% on those VMs). 2) Right-size VMs: Check actual CPU/memory usage with kubectl top nodes and Container Insights. If nodes average 30% utilization, downsize VMs or reduce count. 3) Autoscaler: Enable cluster autoscaler on all user pools — scale down during off-peak automatically. 4) Scale-to-zero: GPU and specialty pools should scale to 0 when idle. 5) Reserved instances: For the baseline node count that always runs, purchase 1-year Azure Reserved Instances (save 30-40%). 6) Stop dev/staging clusters after hours. Combined, these typically achieve 40-60% savings.

🌍 Real-World Use Case

A media streaming company optimized their AKS cluster with a multi-pool strategy:

📝 Summary

← Back to AKS Course