If you work with multiple AKS clusters, each get-credentials call adds a new context to your kubeconfig. Always check which cluster you're targeting with kubectl config current-context before running commands — especially destructive ones like delete or drain.
Cluster Creation & Configuration
Create AKS clusters from the CLI, understand every critical flag, connect kubectl, manage upgrades, and optimize costs with start/stop.
🧒 Simple Explanation (ELI5)
Creating an AKS cluster is like ordering a custom computer online:
- You pick the CPU and RAM (VM size for nodes)
- You choose how many units (node count)
- You select the operating system (Ubuntu, AzureLinux)
- You decide on networking options (like choosing WiFi vs Ethernet)
- Azure builds and ships it in about 5-10 minutes
Once it arrives, you plug in your keyboard (kubectl) using the delivery instructions (kubeconfig) and start working.
🔧 Technical Explanation
az aks create — The Master Command
The az aks create command has dozens of flags. Here are the ones that matter for every cluster:
| Flag | What It Does | Recommendation |
|---|---|---|
--resource-group | Resource group to create the cluster in | Use a dedicated RG per cluster or environment |
--name | Cluster name (must be unique in the RG) | Convention: {env}-{app}-aks (e.g., prod-api-aks) |
--node-count | Number of nodes in the default node pool | Dev: 1-2, Staging: 2-3, Prod: 3+ |
--node-vm-size | Azure VM SKU for worker nodes | Dev: Standard_B2s, Prod: Standard_D4s_v5 |
--kubernetes-version | K8s version to install | Use N-1 (one behind latest) for stability |
--network-plugin | Networking model: azure (CNI) or kubenet | azure for production (pods get VNet IPs) |
--enable-managed-identity | Use managed identity instead of service principal | Always use this — service principals are deprecated |
--generate-ssh-keys | Auto-generate SSH keys for node access | Use for dev; provide your own keys for prod |
--enable-addons | Enable AKS add-ons (monitoring, azure-policy, etc.) | Enable monitoring for all clusters |
--zones | Distribute nodes across availability zones | Always use 1 2 3 for production |
--tier | Cluster SKU tier: free or standard | standard for production (SLA-backed) |
--max-pods | Max pods per node (default: 30 for kubenet, 250 for Azure CNI) | 110 is a safe production default |
Production vs Development Configurations
| Setting | Development | Production |
|---|---|---|
| Node count | 1-2 | 3+ (across availability zones) |
| VM size | Standard_B2s ($30/mo) | Standard_D4s_v5 ($140/mo) |
| K8s version | Latest (experiment) | N-1 (stability) |
| Network plugin | kubenet (simpler) | Azure CNI (VNet integration) |
| SLA tier | Free | Standard (99.95% SLA) |
| Availability zones | Not needed | Zones 1, 2, 3 |
| Monitoring | Optional | Container Insights + Defender |
| API server access | Public | Private cluster or authorized IP ranges |
| Auto-upgrade | patch | node-image (safer) |
kubeconfig Management
After creating a cluster, you need to configure kubectl to connect to it. The az aks get-credentials command merges the cluster's connection info into your local kubeconfig file (~/.kube/config).
Cluster Upgrades
AKS supports in-place Kubernetes version upgrades. The process:
- Control plane upgrades first — Azure updates the managed API server, etcd, scheduler, and controllers.
- Node pools upgrade second — Nodes are cordoned, drained, reimaged with the new K8s version, and uncordoned. This is a rolling update (one node at a time by default).
- You can skip minor versions — but not major ones (e.g., 1.27 → 1.28 is fine; 1.27 → 1.29 requires going through 1.28 first).
Start/Stop for Cost Savings
AKS supports stopping (deallocating) a cluster entirely. This stops all worker nodes and stops billing for compute. The cluster metadata is preserved — when you start it again, everything comes back as it was.
A stopped cluster: loses all running pods, loses ephemeral storage (emptyDir), retains persistent volumes (Azure Disks), retains cluster configuration. Never use this for production. For production cost savings, use cluster autoscaler instead.
⌨️ Hands-on
Create a Development Cluster
# Step 1: Create a resource group az group create --name rg-aks-dev --location eastus # Step 2: Create a minimal dev cluster az aks create \ --resource-group rg-aks-dev \ --name dev-cluster \ --node-count 2 \ --node-vm-size Standard_B2s \ --kubernetes-version 1.29.2 \ --network-plugin kubenet \ --enable-managed-identity \ --generate-ssh-keys \ --enable-addons monitoring \ --tier free \ --tags Environment=dev Team=engineering # This takes ~5-8 minutes. Output includes the cluster's full JSON configuration.
Create a Production Cluster
# Production cluster with all the bells and whistles
az aks create \
--resource-group rg-aks-prod \
--name prod-cluster \
--node-count 3 \
--node-vm-size Standard_D4s_v5 \
--kubernetes-version 1.28.5 \
--network-plugin azure \
--vnet-subnet-id "/subscriptions/{sub}/resourceGroups/rg-network/providers/Microsoft.Network/virtualNetworks/prod-vnet/subnets/aks-subnet" \
--enable-managed-identity \
--enable-addons monitoring,azure-policy \
--zones 1 2 3 \
--tier standard \
--max-pods 110 \
--enable-cluster-autoscaler \
--min-count 3 \
--max-count 10 \
--auto-upgrade-channel node-image \
--tags Environment=production Team=platform CostCenter=CC-1234
# Key differences from dev:
# - Azure CNI with BYO VNet (--network-plugin azure + --vnet-subnet-id)
# - Availability zones (--zones 1 2 3)
# - Standard SLA tier (--tier standard)
# - Cluster autoscaler enabled
# - Auto-upgrade for node images
# - Azure Policy add-on for governance
Connect and Verify
# Download credentials and set kubectl context az aks get-credentials --resource-group rg-aks-dev --name dev-cluster # Verify the connection kubectl cluster-info # Example output: # Kubernetes control plane is running at https://dev-cluster-rg-aks-dev-abc123.hcp.eastus.azmk8s.io:443 # CoreDNS is running at https://... # Check nodes are ready kubectl get nodes -o wide # Check your current context (which cluster kubectl is talking to) kubectl config current-context # Output: dev-cluster # List all contexts (if you have multiple clusters) kubectl config get-contexts # Switch between clusters kubectl config use-context prod-cluster
Cluster Upgrades
# Check available upgrades for your cluster
az aks get-upgrades --resource-group rg-aks-dev --name dev-cluster -o table
# Example output:
# Name ResourceGroup MasterVersion Upgrades
# ------- --------------- --------------- --------
# default rg-aks-dev 1.29.2 1.30.0
# Upgrade the cluster (control plane + all node pools)
az aks upgrade \
--resource-group rg-aks-dev \
--name dev-cluster \
--kubernetes-version 1.30.0
# Upgrade only the control plane (node pools stay on old version)
az aks upgrade \
--resource-group rg-aks-dev \
--name dev-cluster \
--kubernetes-version 1.30.0 \
--control-plane-only
# Upgrade a specific node pool
az aks nodepool upgrade \
--resource-group rg-aks-dev \
--cluster-name dev-cluster \
--name agentpool \
--kubernetes-version 1.30.0
# Check upgrade status
az aks show --resource-group rg-aks-dev --name dev-cluster \
--query "{version:kubernetesVersion, provisioningState:provisioningState}"
Start/Stop Cluster (Dev Cost Savings)
# Stop a dev cluster at end of day — stops all billing for node VMs
az aks stop --resource-group rg-aks-dev --name dev-cluster
# Check the cluster state
az aks show --resource-group rg-aks-dev --name dev-cluster --query powerState
# Output: { "code": "Stopped" }
# Start the cluster next morning
az aks start --resource-group rg-aks-dev --name dev-cluster
# Verify nodes are back
kubectl get nodes
# Nodes may take 1-3 minutes to become Ready after start
# Pro tip: Automate with Azure Automation
# Create a runbook that runs "az aks stop" at 7 PM and "az aks start" at 8 AM
# Saves ~60% on dev cluster compute costs (13 hours stopped per weekday)
Useful Inspection Commands
# View the full cluster configuration
az aks show --resource-group rg-aks-dev --name dev-cluster -o json | jq '{
name: .name,
version: .kubernetesVersion,
tier: .sku.tier,
nodeCount: .agentPoolProfiles[0].count,
vmSize: .agentPoolProfiles[0].vmSize,
networkPlugin: .networkProfile.networkPlugin,
autoUpgrade: .autoUpgradeProfile.upgradeChannel,
addons: [.addonProfiles | to_entries[] | select(.value.enabled) | .key]
}'
# Check what add-ons are enabled
az aks addon list --resource-group rg-aks-dev --name dev-cluster -o table
# View cluster auto-upgrade settings
az aks show --resource-group rg-aks-dev --name dev-cluster \
--query autoUpgradeProfile -o json
# Check maintenance window configuration
az aks maintenanceconfiguration list \
--resource-group rg-aks-dev --cluster-name dev-cluster -o table
🐛 Debugging Scenarios
Scenario 1: "Cluster creation failed"
# Common errors and fixes:
# ERROR: "The VM size 'Standard_D4s_v5' is not available in location 'westus'"
# Fix: Check available VM sizes in your region
az vm list-sizes --location eastus -o table | grep D4s
# ERROR: "Operation could not be completed as it results in exceeding approved Total Regional Cores quota"
# Fix: Check your quota and request an increase
az vm list-usage --location eastus -o table | grep -i "total regional"
# Go to Azure Portal → Subscriptions → Usage + Quotas → Request increase
# ERROR: "SubnetIsFull" or "InsufficientSubnetSize"
# Fix: Your subnet doesn't have enough IPs for nodes + pods (Azure CNI)
# Azure CNI needs: (max_pods_per_node + 1) × node_count IPs
# For 3 nodes with 110 max pods: (110 + 1) × 3 = 333 IPs → /23 subnet minimum
# Resize subnet or use a larger one
# ERROR: "LinkedAuthorizationFailed"
# Fix: Your managed identity doesn't have permissions on the VNet/subnet
# Grant "Network Contributor" role on the subnet:
az role assignment create \
--assignee $(az aks show -g rg-aks-prod -n prod-cluster --query identity.principalId -o tsv) \
--role "Network Contributor" \
--scope "/subscriptions/{sub}/resourceGroups/rg-network/providers/Microsoft.Network/virtualNetworks/prod-vnet/subnets/aks-subnet"
Scenario 2: "Cluster created but kubectl can't connect"
# Step 1: Make sure you fetched credentials for the right cluster az aks get-credentials --resource-group rg-aks-dev --name dev-cluster --overwrite-existing # Step 2: Check the kubeconfig is valid kubectl config view --minify --raw # Step 3: If using Azure AD auth and getting "Interactive login required" # You need to use the Azure AD credential flow: az aks get-credentials --resource-group rg-aks-dev --name dev-cluster --format exec # Then run any kubectl command — it will open a browser for Azure AD login # Step 4: If using authorized IP ranges MY_IP=$(curl -s https://ifconfig.me) echo "My IP: $MY_IP" az aks show -g rg-aks-dev -n dev-cluster --query apiServerAccessProfile.authorizedIpRanges # If your IP isn't listed: az aks update -g rg-aks-dev -n dev-cluster --api-server-authorized-ip-ranges "$MY_IP/32" # Step 5: Check if the cluster is stopped az aks show -g rg-aks-dev -n dev-cluster --query powerState.code -o tsv # If "Stopped": az aks start -g rg-aks-dev -n dev-cluster
Scenario 3: "Cluster upgrade is stuck or failed"
# Step 1: Check the cluster provisioning state
az aks show -g rg-aks-dev -n dev-cluster --query provisioningState -o tsv
# Possible states: Succeeded, Upgrading, Failed, Canceled
# Step 2: If "Upgrading" for too long, check node pool upgrade status
az aks nodepool show -g rg-aks-dev --cluster-name dev-cluster -n agentpool \
--query "{provisioningState:provisioningState, currentOrchestratorVersion:currentOrchestratorVersion}"
# Step 3: Check for PDB (PodDisruptionBudget) blocking the drain
kubectl get pdb -A
# If a PDB prevents eviction, the node drain hangs, blocking the upgrade
# Temporarily relax the PDB or fix the app to handle disruptions
# Step 4: Check for pods without controllers (bare pods can't be evicted)
kubectl get pods -A --field-selector metadata.ownerReferences=
# Step 5: If the upgrade failed, check the activity log
az monitor activity-log list --resource-group rg-aks-dev --status Failed --offset 2h -o table
# Step 6: For a failed upgrade, retry or force
az aks upgrade -g rg-aks-dev -n dev-cluster --kubernetes-version 1.30.0 --yes
🎯 Interview Questions
Beginner
az aks create --resource-group myRG --name myCluster --generate-ssh-keys. This uses all defaults: 3 nodes, Standard_DS2_v2 VMs, latest stable K8s version, kubenet networking, system-assigned managed identity, free tier. In practice, you should always specify --node-vm-size, --node-count, and --kubernetes-version explicitly for reproducibility.
Run az aks get-credentials --resource-group myRG --name myCluster. This downloads the cluster's kubeconfig and merges it into ~/.kube/config, setting the current context to the new cluster. You can then run kubectl get nodes to verify the connection. Use --overwrite-existing if you need to refresh stale credentials.
kubenet: Nodes get VNet IPs, pods get IPs from a separate CIDR range (not VNet-routable). Simpler setup, fewer IP addresses consumed, but pods can't be reached directly from other VNet resources. Azure CNI: Every pod gets an IP from the VNet subnet. Pods are directly routable, can communicate with other VNet resources using NSGs and UDRs. Requires more IP address planning. Use Azure CNI for production; kubenet for simple dev clusters.
1) Check available versions: az aks get-upgrades -g myRG -n myCluster. 2) Upgrade: az aks upgrade -g myRG -n myCluster --kubernetes-version 1.29.0. The process upgrades the control plane first, then performs a rolling upgrade of node pools (cordon → drain → reimage → uncordon, one node at a time). You can only upgrade one minor version at a time (1.27 → 1.28, not 1.27 → 1.29).
Several strategies: 1) az aks stop / az aks start — stop the cluster outside business hours (saves ~60% on 24/7 pricing). 2) Use smaller VM sizes (Standard_B2s for dev). 3) Use a single node (--node-count 1). 4) Use the free tier (--tier free). 5) Enable spot node pools for non-critical workloads. 6) Automate stop/start with Azure Automation runbooks on a schedule.
Intermediate
Channels: none (manual only), patch (auto-apply patch versions, e.g., 1.28.3 → 1.28.5), stable (auto-upgrade to latest N-1 minor version), rapid (auto-upgrade to latest), node-image (auto-update node OS images without K8s version change). For production: node-image is safest — it keeps OS patches current without changing K8s versions. Combine with Planned Maintenance Windows to control when upgrades happen. Never use rapid in production.
Each az aks get-credentials adds a context to ~/.kube/config. Use: kubectl config get-contexts to list all, kubectl config use-context <name> to switch, kubectl config current-context to verify. For safety, use tools like kubectx for quick switching and kubie for isolated sessions. You can also use --file flag with az aks get-credentials to write to separate files and set KUBECONFIG env var per terminal.
Service principals have client secrets that expire (typically after 1-2 years). When they expire, the cluster breaks — it can't create load balancers, scale nodes, or pull images. You have to manually rotate the secret. Managed identities have no secrets to manage — Azure handles credential rotation transparently. They're also more secure (no secret leakage risk) and simpler to set up. Microsoft recommends managed identity for all new AKS clusters.
Key add-ons: monitoring (Container Insights — metrics, logs, alerts via Azure Monitor — essential for all clusters), azure-policy (enforce governance policies — essential for production), azure-keyvault-secrets-provider (mount Key Vault secrets as volumes — recommended over K8s secrets), ingress-appgw (Application Gateway Ingress Controller). Enable add-ons at create time or later: az aks enable-addons --addons monitoring -g myRG -n myCluster.
Maintenance configurations let you control when Azure applies non-urgent updates. You create a schedule: az aks maintenanceconfiguration add --name default --weekday Saturday --start-hour 2. Azure will attempt to apply updates during that window only. Two configuration types: default (for cluster auto-upgrades and node image upgrades) and aksManagedNodeOSUpgradeSchedule (for node OS security updates). Critical security patches may be applied outside your window, but routine updates respect it.
Scenario-Based
VM size: Standard_D4s_v5 or D8s_v5 (4-8 vCPU, 16-32 GB RAM) for API/web workloads. Node count: Minimum 3 across availability zones (1,2,3). Network: Azure CNI with a BYO VNet — plan for enough IPs (at least /21 subnet for 2000+ pod IPs). SLA: Standard tier for 99.95% API server uptime. Autoscaler: Enable with min=3, max=15. Upgrades: Channel=node-image with Saturday 2AM maintenance window. Add-ons: monitoring + azure-policy + azure-keyvault-secrets-provider. Identity: User-assigned managed identity (reusable across clusters). Security: Authorized IP ranges or private cluster. Enable Defender for Containers.
Azure CNI assigns one VNet IP per pod. For each node: 110 pods + 1 node IP = 111 IPs. For 5 nodes: 111 × 5 = 555 IPs needed. A /24 subnet has only 251 usable IPs (256 minus 5 Azure-reserved). Solution: 1) Use a larger subnet (/22 gives 1019 usable IPs). 2) Reduce max-pods (30 pods × 5 nodes = 155 IPs, fits in /24). 3) Switch to Azure CNI Overlay (pods get IPs from an overlay, not the VNet). 4) Use kubenet (only nodes consume VNet IPs). The right choice depends on whether pods need direct VNet routable IPs.
Prevention: 1) Use kubectl config rename-context to name contexts clearly: dev-eastus, PROD-EASTUS. 2) Use kubie or separate KUBECONFIG files per environment. 3) Implement Kubernetes RBAC — dev team gets read-only ClusterRole on prod, full access on dev. 4) Use Azure AD with Conditional Access — require MFA for prod kubeconfig. 5) Set up admission webhooks (OPA/Gatekeeper) to deny deletes on prod namespace without a specific annotation. 6) Use a kubectl wrapper script that shows a warning banner when the context contains "prod".
Immediate fix: Reset the service principal credentials: az aks update-credentials --resource-group myRG --name myCluster --reset-service-principal --service-principal $APP_ID --client-secret $NEW_SECRET. Long-term fix: Migrate to managed identity: az aks update --resource-group myRG --name myCluster --enable-managed-identity. This is a one-time, non-disruptive operation. Managed identities have no secrets, no expiration, and are auto-rotated by Azure. This is the recommended identity model for all AKS clusters going forward.
Use Bicep or Terraform (depending on team preference). Create a module/template for the AKS cluster with parameterized values (node count, VM size, tier, K8s version, VNet). Use environment-specific parameter files: dev.tfvars, staging.tfvars, prod.tfvars. Store in Git and deploy via CI/CD pipeline (Azure DevOps or GitHub Actions). Key: version-lock the K8s version in IaC to prevent drift. Use Terraform state locking (Azure Storage backend) to prevent concurrent modifications. Test plan output in CI before applying.
🌍 Real-World Use Case
A SaaS company with 3 environments (dev, staging, prod) standardized their AKS cluster creation:
- Infrastructure as Code: All clusters defined in Terraform with a shared module. Environment differences (VM size, node count, features) are parameterized. Pull requests for infrastructure changes go through the same review process as application code.
- Dev clusters: 2 nodes, Standard_B2s, free tier, auto-stop at 7 PM via Azure Automation. Monthly cost: ~$60.
- Staging: 3 nodes, Standard_D2s_v5, free tier, mirrors prod configuration at smaller scale. Monthly cost: ~$200.
- Production: 3-10 nodes (autoscaler), Standard_D4s_v5, standard tier, availability zones, private cluster, Container Insights + Defender. Monthly cost: ~$800-2500 depending on load.
- Result: New environment provisioning dropped from 2 days (manual) to 15 minutes (terraform apply). Zero configuration drift between environments.
📝 Summary
az aks createis the core command — know the critical flags:--node-count,--node-vm-size,--network-plugin,--kubernetes-version,--zones- Always use managed identity (
--enable-managed-identity) over service principals az aks get-credentialsconfigures kubectl — always verify your context before running commands- Upgrades are in-place: control plane first, then rolling node pool updates
az aks stop/startsaves money on dev clusters — automate with Azure Automation- Production clusters need: availability zones, Standard SLA tier, autoscaler, monitoring, and BYO VNet