AKS Foundation with Bicep
Provision a production-ready AKS cluster — networking, identity, and node pool — as auditable, repeatable Bicep.
Simple Explanation (ELI5)
AKS is a managed Kubernetes service but it still depends on real Azure infrastructure: a virtual network for pod and service CIDRs, a managed identity so the cluster can create load balancers and disks on your behalf, and a node pool that defines what VM sizes run your workloads. Doing this through the Azure portal means clicking 25 screens with no history or repeatability. One Bicep file captures all of that as code, so every environment gets the exact same cluster topology every time.
Why Do We Need It?
- AKS clusters have 20+ interdependent settings. A single misconfiguration (wrong subnet size, missing identity role) causes a silent cluster that breaks weeks later.
- Bicep makes every setting auditable through code review before it reaches production.
- Parameterisation lets the same template create a small dev cluster and a production cluster with autoscaling, different node sizes, and availability zones.
- The template becomes the runbook: any engineer can recreate the cluster from scratch in under 10 minutes.
AKS Dependency Chain
Technical Explanation
AKS requires the following resources in order:
- VNet and subnet — the subnet must be large enough for nodes, pods (Azure CNI), and internal load balancers. A /24 handles small clusters; production should use /22 or larger.
- User-assigned managed identity (UAI) — the cluster acts as this identity. It needs Network Contributor on the subnet to create internal load balancers, and AcrPull on any attached container registry. Never use a service principal; managed identities rotate credentials automatically.
- AKS managed cluster — references the subnet ID and UAI resource ID as dependencies. Bicep resolves these from symbolic references automatically.
Key AKS Properties Table
| Property | What it controls | Prod guidance |
|---|---|---|
| agentPoolProfiles.vmSize | Node VM size | Standard_D4s_v3 or larger |
| agentPoolProfiles.count | Initial node count | 3 for HA (one per availability zone) |
| agentPoolProfiles.enableAutoScaling | Cluster autoscaler | true in prod; set minCount and maxCount |
| networkProfile.networkPlugin | kubenet vs Azure CNI | azure for pod-level network policies |
| oidcIssuerProfile.enabled | Workload Identity federation | true — required for Workload Identity |
| identity.type | Cluster identity model | UserAssigned — explicit and auditable |
Full Bicep: VNet, Identity, and AKS Cluster
@description('Environment name used in resource naming')
@allowed(['dev', 'stage', 'prod'])
param environment string = 'dev'
param location string = resourceGroup().location
@description('Number of nodes in the system node pool')
@minValue(1)
@maxValue(100)
param nodeCount int = 2
@description('VM size for cluster nodes')
param nodeVmSize string = 'Standard_D2s_v3'
// ── Networking ─────────────────────────────────────────────────────────────
resource vnet 'Microsoft.Network/virtualNetworks@2023-09-01' = {
name: 'vnet-aks-${environment}'
location: location
properties: {
addressSpace: {
addressPrefixes: ['10.1.0.0/16']
}
subnets: [
{
name: 'snet-aks-nodes'
properties: {
addressPrefix: '10.1.0.0/22' // 1022 usable IPs — fits 100-node Azure CNI cluster
}
}
]
}
}
var aksSubnetId = vnet.properties.subnets[0].id
// ── Managed Identity ────────────────────────────────────────────────────────
resource aksIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2023-01-31' = {
name: 'id-aks-${environment}'
location: location
}
// ── AKS Cluster ─────────────────────────────────────────────────────────────
resource aks 'Microsoft.ContainerService/managedClusters@2024-02-01' = {
name: 'aks-platform-${environment}'
location: location
identity: {
type: 'UserAssigned'
userAssignedIdentities: {
'${aksIdentity.id}': {}
}
}
properties: {
dnsPrefix: 'aksplatform${environment}'
agentPoolProfiles: [
{
name: 'system'
count: nodeCount
vmSize: nodeVmSize
mode: 'System'
vnetSubnetID: aksSubnetId
enableAutoScaling: false
osType: 'Linux'
}
]
networkProfile: {
networkPlugin: 'azure'
serviceCidr: '10.2.0.0/24'
dnsServiceIP: '10.2.0.10'
}
oidcIssuerProfile: {
enabled: true
}
securityProfile: {
workloadIdentity: {
enabled: true
}
}
}
}
output aksName string = aks.name
output aksId string = aks.id
output kubeletIdentityObjectId string = aks.properties.identityProfile.kubeletidentity.objectIdHands-on Steps
- Create a dedicated resource group:
az group create -n rg-aks-dev -l eastus - Build and validate:
az bicep build --file main.bicepthenaz deployment group validate -g rg-aks-dev --template-file main.bicep - Run what-if to preview all 3 resources before deploy:
az deployment group what-if -g rg-aks-dev --template-file main.bicep - Deploy:
az deployment group create -g rg-aks-dev --template-file main.bicep --name aks-deploy - Connect:
az aks get-credentials -g rg-aks-dev -n aks-platform-devthenkubectl get nodes - After landing, assign the identity Network Contributor on the subnet if internal load balancers are needed.
Never create AKS clusters with a service principal. Service principal credentials expire. Use a user-assigned managed identity so the cluster credential is managed by Azure and rotated automatically.
Debugging Scenarios
Runbook 1: Deployment Fails with SubnetIsTooSmall
Symptom: ARM error SubnetIsTooSmall: Subnet snet-aks-nodes does not have enough available IP addresses
- Check the
networkPlugin: Azure CNI assigns one IP per pod, not just per node. A /24 (254 IPs) supports roughly 25 nodes with default pod density. - Increase the subnet CIDR to /22 (1022 IPs) for workloads above 50 nodes.
- If the subnet already exists and is too small, you must create a new subnet, update the Bicep, and redeploy. Azure does not allow in-place subnet resizing that would conflict with existing allocations.
Runbook 2: Identity Does Not Have Sufficient Permissions
Symptom: AKS creates successfully but service type LoadBalancer pods stay in Pending indefinitely. Azure portal shows reconcileLoadBalancer: failed to ensure load balancer
- Check whether the cluster managed identity has the Network Contributor role on the subnet or resource group.
- Assign it:
az role assignment create --assignee <kubelet-object-id> --role "Network Contributor" --scope <subnet-id> - To prevent this in Bicep, add a
roleAssignmentresource block after the identity declaration so the role is always deployed with the cluster.
Runbook 3: Regional Quota Exceeded
Symptom: Deployment fails with QuotaExceeded: Cores quota for Standard DSv3 Family in EastUS has been exceeded
- Run
az vm list-usage --location eastus --query "[?contains(name.value,'DSv3')]" -o tableto check quota. - Request quota increase in Azure portal under Subscriptions - Usage + quotas, or change to a region with available quota.
- For non-production use, switch
nodeVmSizetoStandard_B2swhich consumes burstable quota instead.
Interview Questions
Beginner
Nodes run as VMs inside a subnet. Pods (in Azure CNI mode) consume IPs from the same subnet. The VNet also hosts internal load balancers used by Kubernetes services.
A managed identity is an Azure AD identity whose credentials are managed by the platform. AKS uses it to create load balancers and pull images. Unlike a service principal, a managed identity never needs manual credential rotation.
It sets the hostname prefix for the Kubernetes API server FQDN. It must be unique within the Azure region and can only contain alphanumeric characters and hyphens.
It enables the AKS OIDC Issuer endpoint, which is required for Workload Identity federation. This allows pods to authenticate to Azure services using a service account token instead of a stored credential.
Intermediate
With kubenet, nodes get subnet IPs and pods get a private overlay range — simpler but limited for network policies. With Azure CNI, every pod gets a real subnet IP, enabling direct routing and Azure Network Policies, but requiring a larger subnet.
Azure CNI allocates one IP per pod. On a 30-node cluster with 30 pods per node, you need 900+ IPs. A /24 subnet would be exhausted; /22 or larger is recommended for production clusters.
The cluster creates successfully but any Kubernetes Service of type LoadBalancer stays in Pending because AKS cannot create the Azure load balancer without permission to write to the network resource group.
Declare a Microsoft.Authorization/roleAssignments resource that references the identity principalId and the target scope. This ensures the role is always present after a fresh deployment without manual portal steps.
Scenario-based
Use an environment param with @allowed, then drive nodeVmSize and nodeCount through conditional variables or a parameter file. One main.bicep file plus dev.parameters.json and prod.parameters.json handles both environments cleanly.
Run what-if first to see the delta. Decide whether to accept the manual change by updating the Bicep, or override it by redeploying the Bicep as-is. ARM's incremental mode does not delete unmanaged resources, but it will reconcile the properties Bicep declares.
I show the Bicep template where identity.type is set to UserAssigned and oidcIssuerProfile.enabled is set to true, alongside the CI/CD pipeline that enforces all cluster deployments go through this template. Azure Policy can enforce this further.
Real-world Usage
Enterprise AKS platform teams provision one Bicep template that creates the cluster alongside networking and identity. The template outputs the cluster name, kubelet identity object ID, and OIDC issuer URL. The CI/CD pipeline captures those outputs and uses them to configure Workload Identity bindings, attach container registries, and install Helm charts — all without hard-coded resource names.
Summary
AKS provisioning with Bicep requires understanding three interdependent layers: networking (VNet and subnet sizing), identity (user-assigned managed identity and role assignments), and cluster configuration (node pools, network plugin, OIDC). Getting each layer right in code prevents the most common production failures and makes every cluster an auditable, repeatable artifact.