Basics Lesson 3 of 14

Cluster Creation & Configuration

Create AKS clusters from the CLI, understand every critical flag, connect kubectl, manage upgrades, and optimize costs with start/stop.

🧒 Simple Explanation (ELI5)

Creating an AKS cluster is like ordering a custom computer online:

You pick the CPU and RAM (VM size for nodes)
You choose how many units (node count)
You select the operating system (Ubuntu, AzureLinux)
You decide on networking options (like choosing WiFi vs Ethernet)
Azure builds and ships it in about 5-10 minutes

Once it arrives, you plug in your keyboard (kubectl) using the delivery instructions (kubeconfig) and start working.

🔧 Technical Explanation

az aks create — The Master Command

The az aks create command has dozens of flags. Here are the ones that matter for every cluster:

Flag	What It Does	Recommendation
`--resource-group`	Resource group to create the cluster in	Use a dedicated RG per cluster or environment
`--name`	Cluster name (must be unique in the RG)	Convention: `{env}-{app}-aks` (e.g., prod-api-aks)
`--node-count`	Number of nodes in the default node pool	Dev: 1-2, Staging: 2-3, Prod: 3+
`--node-vm-size`	Azure VM SKU for worker nodes	Dev: Standard_B2s, Prod: Standard_D4s_v5
`--kubernetes-version`	K8s version to install	Use N-1 (one behind latest) for stability
`--network-plugin`	Networking model: `azure` (CNI) or `kubenet`	`azure` for production (pods get VNet IPs)
`--enable-managed-identity`	Use managed identity instead of service principal	Always use this — service principals are deprecated
`--generate-ssh-keys`	Auto-generate SSH keys for node access	Use for dev; provide your own keys for prod
`--enable-addons`	Enable AKS add-ons (monitoring, azure-policy, etc.)	Enable `monitoring` for all clusters
`--zones`	Distribute nodes across availability zones	Always use `1 2 3` for production
`--tier`	Cluster SKU tier: free or standard	`standard` for production (SLA-backed)
`--max-pods`	Max pods per node (default: 30 for kubenet, 250 for Azure CNI)	110 is a safe production default

Production vs Development Configurations

Setting	Development	Production
Node count	1-2	3+ (across availability zones)
VM size	Standard_B2s ($30/mo)	Standard_D4s_v5 ($140/mo)
K8s version	Latest (experiment)	N-1 (stability)
Network plugin	kubenet (simpler)	Azure CNI (VNet integration)
SLA tier	Free	Standard (99.95% SLA)
Availability zones	Not needed	Zones 1, 2, 3
Monitoring	Optional	Container Insights + Defender
API server access	Public	Private cluster or authorized IP ranges
Auto-upgrade	patch	node-image (safer)

kubeconfig Management

After creating a cluster, you need to configure kubectl to connect to it. The az aks get-credentials command merges the cluster's connection info into your local kubeconfig file (~/.kube/config).

⚠️

Multiple Clusters

If you work with multiple AKS clusters, each get-credentials call adds a new context to your kubeconfig. Always check which cluster you're targeting with kubectl config current-context before running commands — especially destructive ones like delete or drain.

Cluster Upgrades

AKS supports in-place Kubernetes version upgrades. The process:

Control plane upgrades first — Azure updates the managed API server, etcd, scheduler, and controllers.
Node pools upgrade second — Nodes are cordoned, drained, reimaged with the new K8s version, and uncordoned. This is a rolling update (one node at a time by default).
You can skip minor versions — but not major ones (e.g., 1.27 → 1.28 is fine; 1.27 → 1.29 requires going through 1.28 first).

Start/Stop for Cost Savings

AKS supports stopping (deallocating) a cluster entirely. This stops all worker nodes and stops billing for compute. The cluster metadata is preserved — when you start it again, everything comes back as it was.

💡

Stop/Start is for Dev Only

A stopped cluster: loses all running pods, loses ephemeral storage (emptyDir), retains persistent volumes (Azure Disks), retains cluster configuration. Never use this for production. For production cost savings, use cluster autoscaler instead.

⌨️ Hands-on

Create a Development Cluster

bash

# Step 1: Create a resource group
az group create --name rg-aks-dev --location eastus

# Step 2: Create a minimal dev cluster
az aks create \
  --resource-group rg-aks-dev \
  --name dev-cluster \
  --node-count 2 \
  --node-vm-size Standard_B2s \
  --kubernetes-version 1.29.2 \
  --network-plugin kubenet \
  --enable-managed-identity \
  --generate-ssh-keys \
  --enable-addons monitoring \
  --tier free \
  --tags Environment=dev Team=engineering

# This takes ~5-8 minutes. Output includes the cluster's full JSON configuration.

Create a Production Cluster

bash

# Production cluster with all the bells and whistles
az aks create \
  --resource-group rg-aks-prod \
  --name prod-cluster \
  --node-count 3 \
  --node-vm-size Standard_D4s_v5 \
  --kubernetes-version 1.28.5 \
  --network-plugin azure \
  --vnet-subnet-id "/subscriptions/{sub}/resourceGroups/rg-network/providers/Microsoft.Network/virtualNetworks/prod-vnet/subnets/aks-subnet" \
  --enable-managed-identity \
  --enable-addons monitoring,azure-policy \
  --zones 1 2 3 \
  --tier standard \
  --max-pods 110 \
  --enable-cluster-autoscaler \
  --min-count 3 \
  --max-count 10 \
  --auto-upgrade-channel node-image \
  --tags Environment=production Team=platform CostCenter=CC-1234

# Key differences from dev:
# - Azure CNI with BYO VNet (--network-plugin azure + --vnet-subnet-id)
# - Availability zones (--zones 1 2 3)
# - Standard SLA tier (--tier standard)  
# - Cluster autoscaler enabled
# - Auto-upgrade for node images
# - Azure Policy add-on for governance

Connect and Verify

bash

# Download credentials and set kubectl context
az aks get-credentials --resource-group rg-aks-dev --name dev-cluster

# Verify the connection
kubectl cluster-info

# Example output:
# Kubernetes control plane is running at https://dev-cluster-rg-aks-dev-abc123.hcp.eastus.azmk8s.io:443
# CoreDNS is running at https://...

# Check nodes are ready
kubectl get nodes -o wide

# Check your current context (which cluster kubectl is talking to)
kubectl config current-context
# Output: dev-cluster

# List all contexts (if you have multiple clusters)
kubectl config get-contexts

# Switch between clusters
kubectl config use-context prod-cluster

Cluster Upgrades

bash

# Check available upgrades for your cluster
az aks get-upgrades --resource-group rg-aks-dev --name dev-cluster -o table

# Example output:
# Name     ResourceGroup    MasterVersion    Upgrades
# -------  ---------------  ---------------  --------
# default  rg-aks-dev       1.29.2           1.30.0

# Upgrade the cluster (control plane + all node pools)
az aks upgrade \
  --resource-group rg-aks-dev \
  --name dev-cluster \
  --kubernetes-version 1.30.0

# Upgrade only the control plane (node pools stay on old version)
az aks upgrade \
  --resource-group rg-aks-dev \
  --name dev-cluster \
  --kubernetes-version 1.30.0 \
  --control-plane-only

# Upgrade a specific node pool
az aks nodepool upgrade \
  --resource-group rg-aks-dev \
  --cluster-name dev-cluster \
  --name agentpool \
  --kubernetes-version 1.30.0

# Check upgrade status
az aks show --resource-group rg-aks-dev --name dev-cluster \
  --query "{version:kubernetesVersion, provisioningState:provisioningState}"

Start/Stop Cluster (Dev Cost Savings)

bash

# Stop a dev cluster at end of day — stops all billing for node VMs
az aks stop --resource-group rg-aks-dev --name dev-cluster

# Check the cluster state
az aks show --resource-group rg-aks-dev --name dev-cluster --query powerState
# Output: { "code": "Stopped" }

# Start the cluster next morning
az aks start --resource-group rg-aks-dev --name dev-cluster

# Verify nodes are back
kubectl get nodes
# Nodes may take 1-3 minutes to become Ready after start

# Pro tip: Automate with Azure Automation
# Create a runbook that runs "az aks stop" at 7 PM and "az aks start" at 8 AM
# Saves ~60% on dev cluster compute costs (13 hours stopped per weekday)

Useful Inspection Commands

bash

# View the full cluster configuration
az aks show --resource-group rg-aks-dev --name dev-cluster -o json | jq '{
  name: .name,
  version: .kubernetesVersion,
  tier: .sku.tier,
  nodeCount: .agentPoolProfiles[0].count,
  vmSize: .agentPoolProfiles[0].vmSize,
  networkPlugin: .networkProfile.networkPlugin,
  autoUpgrade: .autoUpgradeProfile.upgradeChannel,
  addons: [.addonProfiles | to_entries[] | select(.value.enabled) | .key]
}'

# Check what add-ons are enabled
az aks addon list --resource-group rg-aks-dev --name dev-cluster -o table

# View cluster auto-upgrade settings
az aks show --resource-group rg-aks-dev --name dev-cluster \
  --query autoUpgradeProfile -o json

# Check maintenance window configuration
az aks maintenanceconfiguration list \
  --resource-group rg-aks-dev --cluster-name dev-cluster -o table

🐛 Debugging Scenarios

Scenario 1: "Cluster creation failed"

bash

# Common errors and fixes:

# ERROR: "The VM size 'Standard_D4s_v5' is not available in location 'westus'"
# Fix: Check available VM sizes in your region
az vm list-sizes --location eastus -o table | grep D4s

# ERROR: "Operation could not be completed as it results in exceeding approved Total Regional Cores quota"
# Fix: Check your quota and request an increase
az vm list-usage --location eastus -o table | grep -i "total regional"
# Go to Azure Portal → Subscriptions → Usage + Quotas → Request increase

# ERROR: "SubnetIsFull" or "InsufficientSubnetSize"
# Fix: Your subnet doesn't have enough IPs for nodes + pods (Azure CNI)
# Azure CNI needs: (max_pods_per_node + 1) × node_count IPs
# For 3 nodes with 110 max pods: (110 + 1) × 3 = 333 IPs → /23 subnet minimum
# Resize subnet or use a larger one

# ERROR: "LinkedAuthorizationFailed"
# Fix: Your managed identity doesn't have permissions on the VNet/subnet
# Grant "Network Contributor" role on the subnet:
az role assignment create \
  --assignee $(az aks show -g rg-aks-prod -n prod-cluster --query identity.principalId -o tsv) \
  --role "Network Contributor" \
  --scope "/subscriptions/{sub}/resourceGroups/rg-network/providers/Microsoft.Network/virtualNetworks/prod-vnet/subnets/aks-subnet"

Scenario 2: "Cluster created but kubectl can't connect"

bash

# Step 1: Make sure you fetched credentials for the right cluster
az aks get-credentials --resource-group rg-aks-dev --name dev-cluster --overwrite-existing

# Step 2: Check the kubeconfig is valid
kubectl config view --minify --raw

# Step 3: If using Azure AD auth and getting "Interactive login required"
# You need to use the Azure AD credential flow:
az aks get-credentials --resource-group rg-aks-dev --name dev-cluster --format exec
# Then run any kubectl command — it will open a browser for Azure AD login

# Step 4: If using authorized IP ranges
MY_IP=$(curl -s https://ifconfig.me)
echo "My IP: $MY_IP"
az aks show -g rg-aks-dev -n dev-cluster --query apiServerAccessProfile.authorizedIpRanges
# If your IP isn't listed:
az aks update -g rg-aks-dev -n dev-cluster --api-server-authorized-ip-ranges "$MY_IP/32"

# Step 5: Check if the cluster is stopped
az aks show -g rg-aks-dev -n dev-cluster --query powerState.code -o tsv
# If "Stopped": az aks start -g rg-aks-dev -n dev-cluster

Scenario 3: "Cluster upgrade is stuck or failed"

bash

# Step 1: Check the cluster provisioning state
az aks show -g rg-aks-dev -n dev-cluster --query provisioningState -o tsv
# Possible states: Succeeded, Upgrading, Failed, Canceled

# Step 2: If "Upgrading" for too long, check node pool upgrade status
az aks nodepool show -g rg-aks-dev --cluster-name dev-cluster -n agentpool \
  --query "{provisioningState:provisioningState, currentOrchestratorVersion:currentOrchestratorVersion}"

# Step 3: Check for PDB (PodDisruptionBudget) blocking the drain
kubectl get pdb -A
# If a PDB prevents eviction, the node drain hangs, blocking the upgrade
# Temporarily relax the PDB or fix the app to handle disruptions

# Step 4: Check for pods without controllers (bare pods can't be evicted)
kubectl get pods -A --field-selector metadata.ownerReferences= 

# Step 5: If the upgrade failed, check the activity log
az monitor activity-log list --resource-group rg-aks-dev --status Failed --offset 2h -o table

# Step 6: For a failed upgrade, retry or force
az aks upgrade -g rg-aks-dev -n dev-cluster --kubernetes-version 1.30.0 --yes

🎯 Interview Questions

Beginner

Q: What is the minimum command to create an AKS cluster?▼

az aks create --resource-group myRG --name myCluster --generate-ssh-keys. This uses all defaults: 3 nodes, Standard_DS2_v2 VMs, latest stable K8s version, kubenet networking, system-assigned managed identity, free tier. In practice, you should always specify --node-vm-size, --node-count, and --kubernetes-version explicitly for reproducibility.

Q: How do you connect kubectl to an AKS cluster?▼

Run az aks get-credentials --resource-group myRG --name myCluster. This downloads the cluster's kubeconfig and merges it into ~/.kube/config, setting the current context to the new cluster. You can then run kubectl get nodes to verify the connection. Use --overwrite-existing if you need to refresh stale credentials.

Q: What is the difference between --network-plugin azure and kubenet?▼

kubenet: Nodes get VNet IPs, pods get IPs from a separate CIDR range (not VNet-routable). Simpler setup, fewer IP addresses consumed, but pods can't be reached directly from other VNet resources. Azure CNI: Every pod gets an IP from the VNet subnet. Pods are directly routable, can communicate with other VNet resources using NSGs and UDRs. Requires more IP address planning. Use Azure CNI for production; kubenet for simple dev clusters.

Q: How do you upgrade an AKS cluster's Kubernetes version?▼

1) Check available versions: az aks get-upgrades -g myRG -n myCluster. 2) Upgrade: az aks upgrade -g myRG -n myCluster --kubernetes-version 1.29.0. The process upgrades the control plane first, then performs a rolling upgrade of node pools (cordon → drain → reimage → uncordon, one node at a time). You can only upgrade one minor version at a time (1.27 → 1.28, not 1.27 → 1.29).

Q: How can you save money on AKS dev clusters?▼

Several strategies: 1) az aks stop / az aks start — stop the cluster outside business hours (saves ~60% on 24/7 pricing). 2) Use smaller VM sizes (Standard_B2s for dev). 3) Use a single node (--node-count 1). 4) Use the free tier (--tier free). 5) Enable spot node pools for non-critical workloads. 6) Automate stop/start with Azure Automation runbooks on a schedule.

Intermediate

Q: What are AKS auto-upgrade channels and which should you use for production?▼

Channels: none (manual only), patch (auto-apply patch versions, e.g., 1.28.3 → 1.28.5), stable (auto-upgrade to latest N-1 minor version), rapid (auto-upgrade to latest), node-image (auto-update node OS images without K8s version change). For production: node-image is safest — it keeps OS patches current without changing K8s versions. Combine with Planned Maintenance Windows to control when upgrades happen. Never use rapid in production.

Q: How do you manage kubeconfig when working with multiple AKS clusters?▼

Each az aks get-credentials adds a context to ~/.kube/config. Use: kubectl config get-contexts to list all, kubectl config use-context <name> to switch, kubectl config current-context to verify. For safety, use tools like kubectx for quick switching and kubie for isolated sessions. You can also use --file flag with az aks get-credentials to write to separate files and set KUBECONFIG env var per terminal.

Q: Why should you use managed identity instead of a service principal for AKS?▼

Service principals have client secrets that expire (typically after 1-2 years). When they expire, the cluster breaks — it can't create load balancers, scale nodes, or pull images. You have to manually rotate the secret. Managed identities have no secrets to manage — Azure handles credential rotation transparently. They're also more secure (no secret leakage risk) and simpler to set up. Microsoft recommends managed identity for all new AKS clusters.

Q: What AKS add-ons are available and which are essential?▼

Key add-ons: monitoring (Container Insights — metrics, logs, alerts via Azure Monitor — essential for all clusters), azure-policy (enforce governance policies — essential for production), azure-keyvault-secrets-provider (mount Key Vault secrets as volumes — recommended over K8s secrets), ingress-appgw (Application Gateway Ingress Controller). Enable add-ons at create time or later: az aks enable-addons --addons monitoring -g myRG -n myCluster.

Q: How do Planned Maintenance Windows work in AKS?▼

Maintenance configurations let you control when Azure applies non-urgent updates. You create a schedule: az aks maintenanceconfiguration add --name default --weekday Saturday --start-hour 2. Azure will attempt to apply updates during that window only. Two configuration types: default (for cluster auto-upgrades and node image upgrades) and aksManagedNodeOSUpgradeSchedule (for node OS security updates). Critical security patches may be applied outside your window, but routine updates respect it.

Scenario-Based

Q: You need to create an AKS cluster that will run a production e-commerce platform. Walk through the key decisions.▼

VM size: Standard_D4s_v5 or D8s_v5 (4-8 vCPU, 16-32 GB RAM) for API/web workloads. Node count: Minimum 3 across availability zones (1,2,3). Network: Azure CNI with a BYO VNet — plan for enough IPs (at least /21 subnet for 2000+ pod IPs). SLA: Standard tier for 99.95% API server uptime. Autoscaler: Enable with min=3, max=15. Upgrades: Channel=node-image with Saturday 2AM maintenance window. Add-ons: monitoring + azure-policy + azure-keyvault-secrets-provider. Identity: User-assigned managed identity (reusable across clusters). Security: Authorized IP ranges or private cluster. Enable Defender for Containers.

Q: Your cluster creation fails with "InsufficientSubnetSize." The subnet is a /24 (256 IPs). You want 5 nodes with 110 max pods using Azure CNI. Why did it fail?▼

Azure CNI assigns one VNet IP per pod. For each node: 110 pods + 1 node IP = 111 IPs. For 5 nodes: 111 × 5 = 555 IPs needed. A /24 subnet has only 251 usable IPs (256 minus 5 Azure-reserved). Solution: 1) Use a larger subnet (/22 gives 1019 usable IPs). 2) Reduce max-pods (30 pods × 5 nodes = 155 IPs, fits in /24). 3) Switch to Azure CNI Overlay (pods get IPs from an overlay, not the VNet). 4) Use kubenet (only nodes consume VNet IPs). The right choice depends on whether pods need direct VNet routable IPs.

Q: A team member ran az aks get-credentials for a production cluster and accidentally deleted pods thinking they were on the dev cluster. How do you prevent this?▼

Prevention: 1) Use kubectl config rename-context to name contexts clearly: dev-eastus, PROD-EASTUS. 2) Use kubie or separate KUBECONFIG files per environment. 3) Implement Kubernetes RBAC — dev team gets read-only ClusterRole on prod, full access on dev. 4) Use Azure AD with Conditional Access — require MFA for prod kubeconfig. 5) Set up admission webhooks (OPA/Gatekeeper) to deny deletes on prod namespace without a specific annotation. 6) Use a kubectl wrapper script that shows a warning banner when the context contains "prod".

Q: Your AKS cluster uses a service principal whose secret expired. The cluster can't scale or create load balancers. How do you fix it and prevent recurrence?▼

Immediate fix: Reset the service principal credentials: az aks update-credentials --resource-group myRG --name myCluster --reset-service-principal --service-principal $APP_ID --client-secret $NEW_SECRET. Long-term fix: Migrate to managed identity: az aks update --resource-group myRG --name myCluster --enable-managed-identity. This is a one-time, non-disruptive operation. Managed identities have no secrets, no expiration, and are auto-rotated by Azure. This is the recommended identity model for all AKS clusters going forward.

Q: You want to create identical AKS clusters across dev, staging, and prod using infrastructure as code. What approach do you take?▼

Use Bicep or Terraform (depending on team preference). Create a module/template for the AKS cluster with parameterized values (node count, VM size, tier, K8s version, VNet). Use environment-specific parameter files: dev.tfvars, staging.tfvars, prod.tfvars. Store in Git and deploy via CI/CD pipeline (Azure DevOps or GitHub Actions). Key: version-lock the K8s version in IaC to prevent drift. Use Terraform state locking (Azure Storage backend) to prevent concurrent modifications. Test plan output in CI before applying.

🌍 Real-World Use Case

A SaaS company with 3 environments (dev, staging, prod) standardized their AKS cluster creation:

Infrastructure as Code: All clusters defined in Terraform with a shared module. Environment differences (VM size, node count, features) are parameterized. Pull requests for infrastructure changes go through the same review process as application code.
Dev clusters: 2 nodes, Standard_B2s, free tier, auto-stop at 7 PM via Azure Automation. Monthly cost: ~$60.
Staging: 3 nodes, Standard_D2s_v5, free tier, mirrors prod configuration at smaller scale. Monthly cost: ~$200.
Production: 3-10 nodes (autoscaler), Standard_D4s_v5, standard tier, availability zones, private cluster, Container Insights + Defender. Monthly cost: ~$800-2500 depending on load.
Result: New environment provisioning dropped from 2 days (manual) to 15 minutes (terraform apply). Zero configuration drift between environments.

📝 Summary

az aks create is the core command — know the critical flags: --node-count, --node-vm-size, --network-plugin, --kubernetes-version, --zones
Always use managed identity (--enable-managed-identity) over service principals
az aks get-credentials configures kubectl — always verify your context before running commands
Upgrades are in-place: control plane first, then rolling node pool updates
az aks stop/start saves money on dev clusters — automate with Azure Automation
Production clusters need: availability zones, Standard SLA tier, autoscaler, monitoring, and BYO VNet

← PreviousAKS Architecture Next →Node Pools

← Back to AKS Course