When you mark a node pool as mode=System, AKS applies the taint CriticalAddonsOnly=true:NoSchedule. This prevents your application pods from scheduling on system nodes — protecting system components from resource contention. Only pods with a matching toleration (which AKS system pods have) can schedule there. This separation is a best practice for production.
Node Pools
System vs user node pools, VM size selection, spot instances for massive savings, taints, labels, and multi-pool strategies for production workloads.
🧒 Simple Explanation (ELI5)
Think of node pools like different teams in a company:
- The System pool is the IT department — they keep the lights on, manage the network, run the mail server. You need them running at all times. (These run CoreDNS, kube-proxy, metrics-server.)
- The User pool is a product team — they work on the actual product. You can add more people when there's a big project, and scale down during quiet periods.
- You might have a GPU pool for AI/ML engineers — expensive specialists who only come in when there's model training work.
- A Spot pool is like hiring temp workers at a huge discount — they're cheap but can be sent home with no notice if the company needs the desks for someone else.
You wouldn't put the IT team and the factory workers in the same room. Similarly, you separate Kubernetes system components from your application workloads into different node pools.
🔧 Technical Explanation
System vs User Node Pools
Every AKS cluster requires at least one system node pool. System node pools run critical Kubernetes components.
| Aspect | System Node Pool | User Node Pool |
|---|---|---|
| Purpose | Runs Kubernetes system pods (CoreDNS, metrics-server, kube-proxy, CSI drivers) | Runs your application workloads |
| Required? | Yes — at least one system pool must exist | Optional (but recommended for production) |
| Default taint | CriticalAddonsOnly=true:NoSchedule (when dedicated system pool) | None |
| Minimum nodes | 1 (dev) or 3 (production with zones) | 0 (can scale to zero) |
| Scale to zero? | No — cluster breaks without system pods | Yes — great for cost savings |
| VM size recommendation | Standard_D2s_v5 (2 vCPU, 8 GB) | Depends on workload |
VM Sizes for Different Workloads
| Workload Type | Recommended VM Series | Example SKU | Specs | ~Monthly Cost |
|---|---|---|---|---|
| System pool | Dsv5 (general purpose) | Standard_D2s_v5 | 2 vCPU, 8 GB RAM | ~$70 |
| Web APIs / microservices | Dsv5 (general purpose) | Standard_D4s_v5 | 4 vCPU, 16 GB RAM | ~$140 |
| Memory-intensive (caching, search) | Esv5 (memory optimized) | Standard_E4s_v5 | 4 vCPU, 32 GB RAM | ~$185 |
| CPU-intensive (batch, encoding) | Fsv2 (compute optimized) | Standard_F8s_v2 | 8 vCPU, 16 GB RAM | ~$245 |
| AI/ML training | NC-series (GPU) | Standard_NC6s_v3 | 6 vCPU, 112 GB RAM, 1× V100 | ~$2,200 |
| Dev/test | Bs-series (burstable) | Standard_B2s | 2 vCPU, 4 GB RAM | ~$30 |
| Windows workloads | Dsv5 (general purpose) | Standard_D4s_v5 | 4 vCPU, 16 GB RAM | ~$180 (Windows license) |
Spot Node Pools
Spot instances use Azure's spare compute capacity at up to 90% discount. The catch: Azure can evict your nodes with 30 seconds notice when it needs the capacity back.
| Setting | Description |
|---|---|
--priority Spot | Creates a spot node pool (discounted VMs) |
--eviction-policy Delete | Evicted VMs are deleted (recommended). Alternative: Deallocate (preserves OS disk) |
--spot-max-price -1 | Pay market price (recommended). Or set a cap: --spot-max-price 0.05 |
Only run workloads on spot nodes that can tolerate interruption: batch jobs, CI/CD runners, stateless workers, dev/test, data processing. Never run your production API or database on spot nodes. AKS applies a kubernetes.azure.com/scalesetpriority:spot taint — your pods need a matching toleration.
Node Pool Scaling Options
| Method | How | Use Case |
|---|---|---|
| Manual scaling | az aks nodepool scale --node-count 5 | Known capacity needs, planned events |
| Cluster autoscaler | --enable-cluster-autoscaler --min-count 2 --max-count 10 | Variable traffic, auto-adjust to demand |
| Scale to zero | --min-count 0 (user pools only) | GPU/spot pools that aren't always needed |
Taints and Labels
Taints and labels on node pools control which pods schedule where:
- Labels — Key-value metadata on nodes. Pods use
nodeSelectorornodeAffinityto target specific node pools. - Taints — Repel pods unless they have a matching toleration. Used to reserve node pools for specific workloads.
OS SKU Options
| OS SKU | Description | When to Use |
|---|---|---|
| Ubuntu | Default Linux OS for AKS nodes. Battle-tested, broad compatibility. | Default choice for most workloads |
| AzureLinux | Microsoft's Linux distro (formerly CBL-Mariner). Smaller image, faster boot, more secure. | Performance-sensitive or security-hardened clusters |
| Windows2022 | Windows Server node pool. Runs Windows containers. | .NET Framework apps, Windows-only workloads |
Max Pods Per Node
The --max-pods setting determines how many pods a single node can run:
- kubenet default: 110
- Azure CNI default: 30 (because each pod consumes a VNet IP)
- Recommended for production: 110 with Azure CNI Overlay, or 50-110 with Azure CNI (plan subnet size accordingly)
- This value is set at node pool creation and cannot be changed later — you must create a new node pool to change it
📊 Multi-Pool Architecture
toleration: gpu=true
nodeSelector: pool=gpu
⌨️ Hands-on
List Existing Node Pools
# List all node pools in your cluster az aks nodepool list --resource-group rg-dev --cluster-name dev-cluster -o table # Example output: # Name OsType VmSize Count Mode OrchestratorVersion # ---------- -------- --------------- ----- ------ ------------------- # agentpool Linux Standard_D2s_v5 2 System 1.29.2
Add a User Node Pool
# Add a user pool for application workloads az aks nodepool add \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name apppool \ --mode User \ --node-count 3 \ --node-vm-size Standard_D4s_v5 \ --max-pods 110 \ --zones 1 2 3 \ --labels workload=app environment=dev \ --enable-cluster-autoscaler \ --min-count 2 \ --max-count 8 \ --os-sku AzureLinux # Verify the pool was added az aks nodepool list -g rg-dev --cluster-name dev-cluster -o table kubectl get nodes -l agentpool=apppool
Add a Spot Node Pool
# Add a spot pool for batch/CI workloads (up to 90% cheaper) az aks nodepool add \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name spotpool \ --mode User \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --node-count 2 \ --node-vm-size Standard_D4s_v5 \ --max-pods 110 \ --labels workload=batch priority=spot \ --node-taints "kubernetes.azure.com/scalesetpriority=spot:NoSchedule" \ --enable-cluster-autoscaler \ --min-count 0 \ --max-count 10 # AKS automatically adds the spot taint, but adding it explicitly ensures clarity # To schedule pods on spot nodes, add this toleration to your pod spec: # tolerations: # - key: "kubernetes.azure.com/scalesetpriority" # operator: "Equal" # value: "spot" # effect: "NoSchedule"
Add a GPU Node Pool
# Add a GPU pool for ML workloads (scale to zero when not training) az aks nodepool add \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name gpupool \ --mode User \ --node-count 0 \ --node-vm-size Standard_NC6s_v3 \ --node-taints "sku=gpu:NoSchedule" \ --labels workload=ml accelerator=nvidia \ --enable-cluster-autoscaler \ --min-count 0 \ --max-count 3 # When a pod with matching toleration + GPU resource request appears, # the autoscaler spins up a GPU node. When job finishes, scales back to 0.
Scale a Node Pool
# Manual scale — set exact node count az aks nodepool scale \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name apppool \ --node-count 5 # Update autoscaler limits az aks nodepool update \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name apppool \ --update-cluster-autoscaler \ --min-count 3 \ --max-count 15 # Disable autoscaler (switch to manual) az aks nodepool update \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name apppool \ --disable-cluster-autoscaler
Inspect Nodes and Labels
# List nodes with their pool, VM size, and zone kubectl get nodes -o custom-columns=\ NAME:.metadata.name,\ POOL:.metadata.labels.agentpool,\ VM:.metadata.labels.node\\.kubernetes\\.io/instance-type,\ ZONE:.metadata.labels.topology\\.kubernetes\\.io/zone,\ STATUS:.status.conditions[-1].type # Example output: # NAME POOL VM ZONE STATUS # aks-agentpool-12345-vmss000000 agentpool Standard_D2s_v5 eastus-1 Ready # aks-agentpool-12345-vmss000001 agentpool Standard_D2s_v5 eastus-2 Ready # aks-apppool-67890-vmss000000 apppool Standard_D4s_v5 eastus-1 Ready # aks-apppool-67890-vmss000001 apppool Standard_D4s_v5 eastus-2 Ready # aks-apppool-67890-vmss000002 apppool Standard_D4s_v5 eastus-3 Ready # Check taints on a node kubectl describe node aks-agentpool-12345-vmss000000 | grep -A 3 "Taints:" # Taints: CriticalAddonsOnly=true:NoSchedule # List all labels on a specific node pool's nodes kubectl get nodes -l agentpool=spotpool --show-labels
Deploy a Pod to a Specific Node Pool
# deploy-to-apppool.yaml — target the app pool using nodeSelector
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
spec:
replicas: 3
selector:
matchLabels:
app: web-api
template:
metadata:
labels:
app: web-api
spec:
nodeSelector:
agentpool: apppool # targets the apppool node pool
containers:
- name: web-api
image: myacr.azurecr.io/web-api:v1.2
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: 1Gi
# batch-job-on-spot.yaml — schedule job on spot nodes with toleration
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
spec:
template:
spec:
nodeSelector:
agentpool: spotpool # target spot pool
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
containers:
- name: processor
image: myacr.azurecr.io/data-processor:v2.0
resources:
requests:
cpu: "2"
memory: 4Gi
restartPolicy: OnFailure
Upgrade a Node Pool
# Upgrade a specific node pool to a new K8s version az aks nodepool upgrade \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name apppool \ --kubernetes-version 1.30.0 # Upgrade node image only (no K8s version change — just OS patches) az aks nodepool upgrade \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name apppool \ --node-image-only # Check current node image version az aks nodepool show -g rg-dev --cluster-name dev-cluster -n apppool \ --query nodeImageVersion -o tsv
🐛 Debugging Scenarios
Scenario 1: "Pods stuck in Pending — no matching node pool"
# Step 1: Check the pod events
kubectl describe pod <pod-name> | grep -A 10 "Events:"
# Look for: "0/5 nodes are available: 3 node(s) had untolerated taint..."
# Step 2: Check what nodeSelector/tolerations the pod requires
kubectl get pod <pod-name> -o jsonpath='{.spec.nodeSelector}'
kubectl get pod <pod-name> -o jsonpath='{.spec.tolerations}'
# Step 3: Check what taints exist on nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Example output:
# NAME TAINTS
# aks-agentpool-...-vmss000000 [map[effect:NoSchedule key:CriticalAddonsOnly value:true]]
# aks-apppool-...-vmss000000 <none>
# Step 4: Common causes:
# - Pod has nodeSelector for a pool that scaled to zero → wait for autoscaler
# - Pod needs GPU but targets wrong pool → fix nodeSelector
# - All user pools are tainted and pod has no toleration → add toleration
# - Pod requests more CPU/memory than any node can provide → use larger VM size
# Step 5: If autoscaler should scale up but doesn't, check its status
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50 | grep -i "scale"
Scenario 2: "Spot node was evicted — pods rescheduled but some data was lost"
# Step 1: Confirm eviction happened kubectl get events --sort-by='.lastTimestamp' | grep -i "evict\|preempt\|spot" # Step 2: Check which nodes were affected kubectl get nodes -l kubernetes.azure.com/scalesetpriority=spot -o wide # Evicted nodes will be gone; replacements may already be provisioning # Step 3: Check pod status — pods should reschedule on surviving nodes kubectl get pods -o wide | grep -v Running # Step 4: Data loss root cause — spot pods used emptyDir or local volumes # Fix: Use PersistentVolumeClaims with Azure Disks or Azure Files # These survive node evictions because the data is on Azure storage, not the VM # Step 5: Add a PodDisruptionBudget to ensure minimum availability # apiVersion: policy/v1 # kind: PodDisruptionBudget # metadata: # name: processor-pdb # spec: # minAvailable: 1 # selector: # matchLabels: # app: data-processor # Step 6: Ensure your workload handles SIGTERM gracefully # Spot evictions send SIGTERM → 30 second grace period → SIGKILL # Your app should checkpoint or save state within those 30 seconds
Scenario 3: "Node pool add fails with 'insufficient quota'"
# Step 1: Check your current quota usage
az vm list-usage --location eastus -o table | grep -i "total\|standard D\|standard NC"
# Example output:
# CurrentValue Limit Name
# 12 20 Total Regional vCPUs
# 8 20 Standard DSv5 Family vCPUs
# 0 0 Standard NCSv3 Family vCPUs ← GPU quota is 0!
# Step 2: Request a quota increase
# Azure Portal → Subscriptions → Usage + Quotas → Request Increase
# Or via CLI:
az quota create --resource-name "StandardNCSv3Family" \
--scope "/subscriptions/{sub-id}/providers/Microsoft.Compute/locations/eastus" \
--limit-object value=12
# Step 3: For GPU VMs, quota increases may take 1-2 business days
# Workaround: try a different region with available capacity
# Step 4: Verify quota was increased before retrying
az vm list-usage --location eastus -o table | grep "NC"
Scenario 4: "Pods scheduled on system pool despite having a user pool"
# Step 1: Check if the system pool has the CriticalAddonsOnly taint kubectl describe node aks-agentpool-12345-vmss000000 | grep -A 3 "Taints:" # If "Taints: <none>" — the system pool isn't tainted # Step 2: The default node pool created with az aks create is mode=System # but doesn't have the taint unless you have a separate user pool # When only one pool exists, all pods schedule there (including yours) # Step 3: To enforce separation, make sure you have both pools: az aks nodepool list -g rg-dev --cluster-name dev-cluster -o table # If only one pool, add a user pool (see Hands-on section above) # Step 4: Add the taint to the system pool manually if needed az aks nodepool update \ --resource-group rg-dev \ --cluster-name dev-cluster \ --name agentpool \ --node-taints "CriticalAddonsOnly=true:NoSchedule" # Step 5: Verify pods migrate to the user pool kubectl get pods -o wide # Pods without the toleration will be evicted from system nodes # and rescheduled on user pool nodes
🎯 Interview Questions
Beginner
A node pool is a group of nodes (Azure VMs) with identical configuration — same VM size, OS, and Kubernetes version. Each node pool maps to a VM Scale Set in the _MC_ resource group. AKS clusters can have multiple node pools with different configurations, allowing you to run different workload types on optimized hardware.
System pools run Kubernetes system components (CoreDNS, metrics-server, kube-proxy, CSI drivers). At least one system pool must exist. They can't scale to zero. They have the CriticalAddonsOnly=true:NoSchedule taint to prevent application pods from scheduling there. User pools run your application workloads. They're optional, can scale to zero, and have no default taints. Best practice: separate system and user pools to prevent application resource contention from affecting system stability.
Spot node pools use Azure's spare compute capacity at up to 90% discount. Tradeoff: Azure can evict these nodes with 30 seconds notice when it needs the capacity. Use for: batch processing, CI/CD runners, dev/test environments, data pipelines, ML training with checkpointing, and any workload that can tolerate interruption. Never use for: production APIs, databases, or any workload that can't handle sudden termination.
Yes. User node pools can be configured with --min-count 0 when cluster autoscaler is enabled. This is especially useful for GPU or spot pools that aren't always needed. When a pod requests resources that match the pool (via nodeSelector or toleration), the autoscaler spins up a node. When the workload completes and no pods need the pool, it scales back to zero. System pools can never scale to zero.
Three mechanisms: 1) nodeSelector — simple key-value match: nodeSelector: { agentpool: apppool }. 2) Taints + Tolerations — nodes repel pods unless they have a matching toleration. Used for system pool separation and spot/GPU pool isolation. 3) Node Affinity — more expressive rules (preferred vs required, multiple conditions). In practice, most teams use nodeSelector + taints for pool targeting.
Intermediate
The max-pods setting determines the IP allocation and routing configuration for each node at creation time. With Azure CNI, each pod pre-allocates a VNet IP — changing max-pods would require re-IPing all pods and reconfiguring VMSS networking, which isn't safe to do in-place. To change max-pods, create a new node pool with the desired setting, cordon + drain the old pool, and delete it. This is a design constraint of how Azure CNI allocates IPs.
Scale up: When pods are Pending because no node has enough resources (CPU/memory) to schedule them. The autoscaler simulates adding a node and checks if the pending pods would fit. Scale down: When a node's utilization (requested resources / allocatable) drops below ~50% (default) for 10+ minutes, and all pods on that node can be moved elsewhere. Nodes with local storage, pods without controllers, or pods with restrictive PDBs won't be scaled down. Scale-down is conservative to avoid thrashing.
Azure sends a 30-second eviction notice. The node is drained: kubelet sends SIGTERM to all pod containers, waits for the grace period (default 30s), then SIGKILL. If eviction-policy=Delete, the VM is deleted entirely. The pods' controller (Deployment, Job, etc.) detects the pod termination and creates replacement pods — the scheduler places them on available non-spot nodes or surviving spot nodes. In-memory data and emptyDir volumes are lost. PersistentVolumes backed by Azure Disks survive and re-attach.
Windows node pools are for running Windows containers — typically legacy .NET Framework applications that can't run on Linux. Key constraints: Windows pools can only be user pools (system pool must be Linux), they cost more (Windows Server license included in VM pricing), have fewer AKS features (no Azure Linux, limited network policies), and have slower node startup. If your .NET app targets .NET 6+ (or later), containerize it on Linux instead — it's cheaper, faster, and has better AKS support.
AKS performs rolling upgrades by default: 1) A new node with the new version is added (surge node). 2) An old node is cordoned (no new pods). 3) Pods are drained (evicted) from the old node. 4) Old node is deleted. This repeats for each node. Configure max surge: --max-surge 1 (one node at a time, conservative) or --max-surge 33% (faster but uses more temp resources). Ensure PDBs allow at least one pod to be evicted. The process is automatic — you just trigger the upgrade command.
Scenario-Based
Three pools: 1) System pool: 2× Standard_D2s_v5, mode=System, always-on for K8s system pods. 2) App pool: Standard_D4s_v5, autoscaler min=2 max=6, mode=User — runs the web app with nodeSelector. 3) Spot pool: Standard_D4s_v5, priority=Spot, autoscaler min=0 max=10 — runs batch jobs with tolerations. 4) GPU pool: Standard_NC6s_v3, autoscaler min=0 max=2, taint sku=gpu:NoSchedule — ML training only, scales to zero when no training jobs. Total cost optimization: web app on reliable VMs, batch on cheap spot, GPU only when needed.
The message tells you: 2 system nodes are blocked by the CriticalAddonsOnly taint (correct behavior — app pods shouldn't go there). 3 other nodes exist but don't match the pod's nodeSelector or affinity rule. Fix: Check the pod's nodeSelector: kubectl get pod <pod> -o jsonpath='{.spec.nodeSelector}'. It probably targets a pool that either doesn't exist yet (scale-to-zero), has wrong labels, or was deleted. Either: a) Fix the deployment's nodeSelector to target an existing pool, b) Create the expected pool with matching labels, or c) If using autoscaler at min=0, wait for scale-up (check autoscaler logs for errors).
1) Schedule batch jobs outside peak hours (night/weekends) using CronJobs. 2) Use multiple spot VM sizes: --node-vm-size with a VMSS flexible orchestration (if available) or create multiple spot pools with different SKUs — eviction risk is per-SKU. 3) Implement job checkpointing so interrupted jobs resume from last checkpoint. 4) Consider a mixed strategy: some critical batch runs on regular user pool nodes, non-critical on spot. 5) Set --spot-max-price slightly higher than average to reduce eviction (but still cheaper than regular). 6) Use a different region with more spare capacity during those hours.
Since max-pods cannot be changed on an existing pool: 1) Create a new node pool with --max-pods 110: az aks nodepool add --name newpool --max-pods 110. 2) Cordon the old pool: kubectl cordon <old-nodes>. 3) Drain pods from old nodes: kubectl drain <old-node> --ignore-daemonsets --delete-emptydir-data. 4) Delete the old pool: az aks nodepool delete --name agentpool. Now 3 nodes × 110 max-pods = 330 pod capacity — more than enough. Lesson learned: Always set max-pods=110 at cluster creation. The default 30 for Azure CNI is too low for most production clusters.
1) Spot pools for non-critical workloads: Move batch processing, CI runners, and background workers to spot (saves up to 90% on those VMs). 2) Right-size VMs: Check actual CPU/memory usage with kubectl top nodes and Container Insights. If nodes average 30% utilization, downsize VMs or reduce count. 3) Autoscaler: Enable cluster autoscaler on all user pools — scale down during off-peak automatically. 4) Scale-to-zero: GPU and specialty pools should scale to 0 when idle. 5) Reserved instances: For the baseline node count that always runs, purchase 1-year Azure Reserved Instances (save 30-40%). 6) Stop dev/staging clusters after hours. Combined, these typically achieve 40-60% savings.
🌍 Real-World Use Case
A media streaming company optimized their AKS cluster with a multi-pool strategy:
- Before: Single node pool of 20× Standard_D8s_v5 ($2,800/month each). All workloads mixed together. Monthly cost: $56,000. Average utilization: 35%.
- After (4 pools):
- System pool: 3× Standard_D2s_v5 for system pods — $210/month
- API pool: 5× Standard_D4s_v5 (autoscaler 3-8) for application pods — $700/month
- Encoding pool: 0-10× Standard_F8s_v2 Spot for video encoding — ~$250/month (90% discount)
- ML pool: 0-2× Standard_NC6s_v3 for recommendation engine training — $0-4,400/month (only when training)
- Result: Monthly cost dropped from $56,000 to ~$12,000 (79% reduction). Encoding throughput actually increased because F-series VMs are compute-optimized for that workload. ML training only runs twice a week, scaling the GPU pool from 0 to 2 for 6 hours then back to 0.
📝 Summary
- System pools run Kubernetes components; user pools run your workloads — always separate them in production
- Choose VM sizes based on workload type: general purpose (D-series), memory-optimized (E-series), compute (F-series), GPU (NC-series)
- Spot pools offer up to 90% savings for interruptible workloads — never use for production APIs or databases
- Use taints + nodeSelector to control pod placement across pools
- Cluster autoscaler scales node pools automatically; user pools can scale to zero
--max-podsis set at creation and cannot be changed — default to 110 for production- Multiple specialized pools beat one large generic pool for both cost and performance