Intermediate Lesson 5 of 14

Networking in AKS

Master the networking fundamentals of Azure Kubernetes Service — from choosing the right CNI plugin to building private, enterprise-grade cluster topologies.

🧒 Simple Explanation (ELI5)

Think of an AKS cluster like a big office building. Every person (pod) needs an address so mail (network traffic) can reach them. Kubenet is like giving people internal extension numbers — cheap, but outsiders can't dial them directly. Azure CNI is like giving every person their own phone number from the company phone plan — powerful, but you need enough numbers in your plan. The Load Balancer is the front desk that routes visitors to the right floor, and the Ingress Controller is the receptionist who reads the visitor's badge and sends them to the correct office. Network Policies are the locked doors between departments — only people with the right badge can walk through.

🔧 Technical Explanation

Network Plugin Comparison

AKS supports four networking models. Choosing the right one depends on your IP address budget, performance requirements, and Azure feature compatibility.

Feature	Kubenet	Azure CNI	Azure CNI Overlay	CNI Powered by Cilium
Pod IP source	NAT'd bridge (10.244.x.x)	VNet subnet	Private overlay CIDR	Private overlay CIDR
IPs consumed per node	1 (node only)	1 + max-pods	1 (node only)	1 (node only)
Max pods / node default	110	30 (configurable to 250)	250	250
VNet-direct pod routing	No (UDR needed)	Yes	No	No
Network Policy engine	Calico only	Azure NPM or Calico	Calico or Cilium	Cilium (eBPF)
Windows node pools	No	Yes	Yes	No (Linux only)
Best for	Dev/test, small clusters	Enterprise, VNet integration	Large clusters, IP-constrained	Advanced observability, eBPF

💡

Pro Tip

Azure CNI Overlay gives you the best of both worlds — VNet-routable node IPs and an independent pod CIDR so you never run out of IPs. It's the recommended default for new clusters in 2025+.

Azure CNI — Pod IPs come from VNet Subnet

VNet 10.0.0.0/16

→

Node Subnet 10.0.1.0/24

→

Pod gets 10.0.1.47

Azure CNI Overlay — Pod IPs from private CIDR

VNet 10.0.0.0/16

→

Node gets 10.0.1.5

→

Pod gets 192.168.0.12 (overlay)

IP Address Planning

The most common production mistake is running out of IP addresses. Use this sizing guide:

Component	Azure CNI	Azure CNI Overlay
VNet CIDR	10.0.0.0/16 (65,536 IPs)	10.0.0.0/16 (65,536 IPs)
Node subnet	/21 (2,046 IPs) — covers nodes + pods	/24 (254 IPs) — nodes only
Pod CIDR	N/A (uses node subnet)	192.168.0.0/16 (65,536 IPs)
Service CIDR	10.1.0.0/16	10.1.0.0/16
DNS Service IP	10.1.0.10	10.1.0.10

⚠️

Azure CNI Subnet Sizing Formula

(Max nodes × max-pods-per-node) + nodes + 5 reserved. For 10 nodes with 30 pods each: 10×30 + 10 + 5 = 315 IPs → need at least a /23 subnet (510 usable IPs).

Load Balancers

AKS uses Azure Standard Load Balancer by default for Service type: LoadBalancer.

Type	Annotation	Use Case
External (default)	—	Public-facing APIs, web apps
Internal	`service.beta.kubernetes.io/azure-load-balancer-internal: "true"`	Backend services, private APIs
Static IP	`service.beta.kubernetes.io/azure-load-balancer-ipv4`	DNS records, firewall allow-lists

Ingress Controllers

Traffic Flow: Internet → AKS Pods

Client

→

Azure LB

→

Ingress Controller

→

ClusterIP Service

→

Pod

Feature	NGINX Ingress Controller	Application Gateway Ingress (AGIC)
Runs where	Inside the cluster (pods)	Azure-managed (outside cluster)
TLS termination	At NGINX pod	At App Gateway (WAF included)
WAF support	ModSecurity (self-managed)	Azure WAF v2 (managed rules)
Custom routing	Extensive (annotations)	Limited (Azure-defined)
Best for	Flexibility, multi-cloud portability	Azure-native, WAF-first deployments

DNS Integration

CoreDNS runs inside every AKS cluster and handles in-cluster service discovery. For external DNS, ExternalDNS automatically creates Azure DNS records for Ingress and LoadBalancer services.

Network Policies

By default AKS allows all pod-to-pod traffic. Network Policies act as firewalls between pods.

Engine	Supported CNI	Highlights
Azure NPM	Azure CNI	Azure-managed, IPTables-based
Calico	Kubenet, Azure CNI, Overlay	Feature-rich, community standard
Cilium	CNI Powered by Cilium	eBPF-powered, L7 policies, best performance

Private Clusters

A private AKS cluster restricts the API server to a private IP. The control plane is accessible only from within the VNet or peered networks.

❗

Private Cluster Access

You need a jump box, Azure Bastion, VPN Gateway, or ExpressRoute to reach a private cluster's API server. CI/CD pipelines need a self-hosted agent inside the VNet.

Hub-Spoke Topology

Enterprise deployments use a hub-spoke VNet design. The hub VNet contains shared services (firewall, DNS, VPN). Each spoke VNet hosts an AKS cluster. VNets are connected via VNet peering, and all egress traffic flows through an Azure Firewall in the hub.

⌨️ Hands-on

Check Current Network Plugin

bash

# Get the network profile of your AKS cluster
az aks show \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --query "networkProfile" \
  --output table

# Expected output:
# NetworkPlugin    NetworkPolicy    PodCidr          ServiceCidr    DnsServiceIp
# ---------------  ---------------  ---------------  -------------  ------------
# azure            calico           192.168.0.0/16   10.1.0.0/16    10.1.0.10

Create an Internal Load Balancer

yaml

# internal-lb.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-internal
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "svc-subnet"
spec:
  type: LoadBalancer
  ports:
    - port: 443
      targetPort: 8443
      protocol: TCP
  selector:
    app: api-backend

bash

kubectl apply -f internal-lb.yaml

# Wait for the internal IP assignment
kubectl get svc api-internal -w

# NAME           TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
# api-internal   LoadBalancer   10.1.0.145    10.0.2.50     443:31998/TCP   45s

Deploy NGINX Ingress Controller

bash

# Add the ingress-nginx Helm repo
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update

# Install NGINX Ingress with an Azure public IP
helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-system \
  --create-namespace \
  --set controller.replicaCount=2 \
  --set controller.service.annotations."service\.beta\.kubernetes\.io/azure-load-balancer-health-probe-request-path"="/healthz"

# Verify the external IP
kubectl get svc -n ingress-system ingress-nginx-controller

Test DNS Resolution Inside a Pod

bash

# Spin up a debug pod
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- sh

# Inside the pod, test CoreDNS resolution
nslookup kubernetes.default.svc.cluster.local
# Server:    10.1.0.10
# Address:   10.1.0.10:53
# Name:      kubernetes.default.svc.cluster.local
# Address:   10.1.0.1

# Test external DNS resolution
nslookup microsoft.com
exit

Apply a Deny-All Network Policy

yaml

# deny-all.yaml — block all ingress to the production namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
    - Ingress

bash

kubectl apply -f deny-all.yaml

# Verify — traffic from another namespace should now be blocked
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \
  -n staging -- curl -s --max-time 3 http://api-svc.production.svc.cluster.local
# curl: (28) Connection timed out

🐛 Debugging Scenarios

Scenario 1: Service type LoadBalancer Stuck in "Pending"

Symptom: kubectl get svc shows EXTERNAL-IP as <pending> for more than 5 minutes.

bash

# Step 1: Check service events
kubectl describe svc my-loadbalancer-service
# Look for: "Error syncing load balancer" or "EnsureLoadBalancer failed"

# Step 2: Check if the cluster has outbound connectivity
az aks show -g myResourceGroup -n myAKSCluster \
  --query "networkProfile.outboundType"
# If "userDefinedRouting" — ensure your Azure Firewall/UDR has the right rules

# Step 3: Check Azure public IP quota
az network public-ip list -g MC_myResourceGroup_myAKSCluster_eastus -o table
az quota show --resource-name PublicIPAddresses --scope "/subscriptions/$(az account show --query id -o tsv)/providers/Microsoft.Network/locations/eastus"

# Step 4: For internal LB, verify the subnet annotation matches an existing subnet
az network vnet subnet list \
  -g myResourceGroup --vnet-name myVNet -o table

# Step 5: Check cloud-controller-manager logs
kubectl logs -n kube-system -l component=cloud-controller-manager --tail=50

Scenario 2: Pods Can't Resolve DNS

Symptom: Application pods log Name or service not known or NXDOMAIN errors.

bash

# Step 1: Verify CoreDNS pods are running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Step 2: Check CoreDNS logs for errors
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# Step 3: Test from inside a pod
kubectl run dns-debug --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes.default

# Step 4: If external DNS fails but internal works, check VNet DNS servers
az network vnet show -g myResourceGroup -n myVNet \
  --query "dhcpOptions.dnsServers"
# If custom DNS servers are set, they must forward to 168.63.129.16 (Azure DNS)

# Step 5: Check if CoreDNS ConfigMap has custom overrides
kubectl get configmap coredns-custom -n kube-system -o yaml

# Step 6: Restart CoreDNS to pick up changes
kubectl rollout restart deployment coredns -n kube-system

Scenario 3: Traffic Denied Between Namespaces

Symptom: Pods in the frontend namespace cannot talk to pods in the backend namespace, but everything worked before.

bash

# Step 1: List network policies across all namespaces
kubectl get networkpolicy -A

# Step 2: Inspect the backend namespace policies
kubectl describe networkpolicy -n backend

# Step 3: Look for a deny-all policy that blocks all ingress
# If you see a policy with podSelector: {} and policyTypes: [Ingress] but no ingress rules,
# that's a deny-all — you need to add an allow rule.

# Step 4: Create an allow rule for frontend → backend traffic
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: frontend
      ports:
        - protocol: TCP
          port: 8080
EOF

# Step 5: Test connectivity again
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never \
  -n frontend -- curl -s http://api.backend.svc.cluster.local:8080/healthz

💡

Network Policy Debugging Tip

Network Policies are additive. If any policy selects a pod and specifies policyTypes: [Ingress] without any ingress rules, all ingress to that pod is denied. You need a separate policy that explicitly allows the desired traffic.

🎯 Interview Questions

Beginner

Q: What is the difference between Kubenet and Azure CNI?▼

Kubenet assigns pods IPs from a private bridge network (10.244.x.x) and uses NAT for external communication, consuming only 1 VNet IP per node. Azure CNI assigns each pod a real VNet IP, enabling direct pod-to-VNet-resource communication but consuming significantly more IPs. Kubenet is simpler and cheaper; Azure CNI is needed for VNet integration, Windows nodes, and features like Virtual Nodes.

Q: What is a Kubernetes Ingress resource?▼

An Ingress is a Kubernetes API object that manages external HTTP/HTTPS access to services inside the cluster. It defines rules for routing traffic based on hostnames and paths. An Ingress resource needs an Ingress Controller (like NGINX or AGIC) to function — the resource alone does nothing without a controller watching it.

Q: How does a Kubernetes Service of type LoadBalancer work in AKS?▼

When you create a Service of type LoadBalancer in AKS, the Azure cloud controller manager provisions an Azure Standard Load Balancer with a public (or internal) IP. Traffic arriving at that IP on the specified port is forwarded to healthy pods matching the service's selector across the cluster nodes via node ports.

Q: What is CoreDNS and what role does it play in AKS?▼

CoreDNS is the cluster DNS server in AKS. It resolves in-cluster service names like my-svc.my-namespace.svc.cluster.local to ClusterIP addresses. It also forwards external DNS queries (e.g., microsoft.com) to upstream DNS servers. Every pod's /etc/resolv.conf points to the CoreDNS service IP (typically 10.1.0.10).

Q: What is a Network Policy in Kubernetes?▼

A Network Policy is a Kubernetes resource that controls traffic flow at the IP/port level for pods. By default all traffic is allowed. Once a Network Policy selects a pod, only traffic explicitly allowed by the policy's rules is permitted. AKS supports Network Policies through Azure NPM, Calico, or Cilium engines.

Intermediate

Q: How do you plan IP addresses for an AKS cluster using Azure CNI?▼

The formula is: (max nodes × max-pods-per-node) + max nodes + 5 Azure reserved IPs. For 20 nodes with 30 pods each: 20×30 + 20 + 5 = 625 IPs, requiring at least a /22 subnet (1,022 usable). For Azure CNI Overlay, only node IPs come from the VNet subnet — pods use a separate private CIDR, drastically reducing VNet IP consumption.

Q: What is a private AKS cluster and how do you manage it?▼

A private cluster exposes the API server only on a private IP within the VNet instead of a public FQDN. Access requires being on the same network or a peered/connected network (VPN, ExpressRoute, Bastion). You manage it via a jump box VM inside the VNet or by using az aks command invoke to run kubectl commands via ARM without direct network access. CI/CD requires self-hosted agents within the VNet.

Q: Compare Azure Network Policy Manager (NPM) and Calico in AKS.▼

Azure NPM is Azure-managed, uses IPTables, and supports standard Kubernetes NetworkPolicy resources only. Calico supports standard NetworkPolicy plus its own GlobalNetworkPolicy CRD with richer rules (deny actions, application-layer policies, DNS-based rules). Calico is generally recommended for production workloads due to its maturity and more advanced features. NPM is being deprecated in favor of Azure Network Policy powered by Cilium.

Q: How does Azure CNI Overlay differ from traditional Azure CNI?▼

Azure CNI Overlay assigns node IPs from the VNet subnet but pod IPs from a private, non-routable CIDR (e.g., 192.168.0.0/16). Pods can still communicate with VNet resources — traffic is NATed via the node IP. This decouples pod density from VNet IP planning, supports 250 pods per node, and drastically reduces subnet size requirements. The trade-off is that pods are not directly routable from VNet resources by their pod IP.

Q: What is the ExternalDNS addon and how does it help?▼

ExternalDNS watches Kubernetes Ingress and Service resources and automatically creates DNS records in Azure DNS zones for them. When you annotate a Service or Ingress with external-dns.alpha.kubernetes.io/hostname: myapp.example.com, ExternalDNS creates an A record in your Azure DNS zone pointing to the LoadBalancer's IP. This eliminates manual DNS management.

Scenario-Based

Q: A Service of type LoadBalancer is stuck with EXTERNAL-IP <pending> for 10 minutes. How do you troubleshoot?▼

1) kubectl describe svc — check events for provisioning errors. 2) Verify the cluster's outbound type — if userDefinedRouting, ensure the firewall/UDR allows ARM traffic on port 443. 3) Check Azure Public IP quota in the region. 4) For internal LB, verify the subnet annotation references a real subnet with available IPs. 5) Check cloud-controller-manager logs in kube-system. 6) If the cluster uses a managed identity, verify it has Network Contributor on the VNet/subnet.

Q: Users report high latency between two microservices in the same AKS cluster. What would you investigate?▼

1) Check if pods are on the same node or spread across nodes — cross-node traffic adds latency. 2) Verify no Network Policies are causing dropped packets and retransmissions. 3) Check node-level CPU/memory pressure that might delay packet processing. 4) Investigate kube-proxy mode (iptables vs IPVS) — IPVS handles large service counts better. 5) Check if the service uses ExternalTrafficPolicy: Local (avoids extra hops). 6) Use kubectl exec to run curl -w with timing to isolate DNS vs connection vs transfer latency. 7) Check if CoreDNS is a bottleneck (high QPS, pods restarting).

Q: After enabling Network Policies, your monitoring stack in the "observability" namespace can't scrape metrics from pods in other namespaces. How do you fix it?▼

Create a NetworkPolicy in each target namespace that allows ingress on the metrics port from the observability namespace. Example: allow ingress from namespaceSelector: {matchLabels: {kubernetes.io/metadata.name: observability}} on the port Prometheus scrapes (usually 9090 or the application metrics port). Label the observability namespace if needed. Test with kubectl exec from a Prometheus pod to curl a target's metrics endpoint.

Q: You need to deploy a private AKS cluster for a regulated workload. The cluster must have no public endpoints but still be accessible from an on-premises network. How do you design this?▼

1) Create a private AKS cluster with --enable-private-cluster and a custom private DNS zone. 2) Deploy in a spoke VNet peered with a hub VNet. 3) The hub VNet connects to on-prem via ExpressRoute or Site-to-Site VPN. 4) Configure VNet peering between hub and spoke with "Allow forwarded traffic". 5) Link the private DNS zone to the hub VNet so the API server FQDN resolves on-prem. 6) Route cluster egress through Azure Firewall in the hub using UDR + outbound type userDefinedRouting. 7) Deploy self-hosted CI/CD agents in the spoke VNet.

Q: Pods in your cluster intermittently fail DNS lookups. Some succeed, some get SERVFAIL. What do you check?▼

1) Check CoreDNS pod health: kubectl get pods -n kube-system -l k8s-app=kube-dns — ensure all replicas are Running. 2) Check if CoreDNS pods are OOMKilled — increase memory limits. 3) Look at CoreDNS logs for upstream timeout errors — VNet custom DNS servers may be slow or unreachable. 4) Check if ndots:5 in resolve.conf causes excessive queries — add dnsConfig to pods with ndots: 2. 5) Verify the VNet DNS setting forwards to 168.63.129.16. 6) Scale CoreDNS with kubectl scale deployment coredns -n kube-system --replicas=4 if QPS is too high.

🌍 Real-World Use Case

Enterprise Hub-Spoke Topology with Private AKS

A financial services company runs regulated workloads that cannot have public endpoints. Their architecture:

Hub VNet (10.0.0.0/16): Azure Firewall (10.0.0.4), VPN Gateway for on-prem connectivity, Azure Bastion for jump-box access, central DNS forwarder VMs.
Spoke VNet (10.1.0.0/16): Private AKS cluster using Azure CNI Overlay, node subnet /24, pod CIDR 192.168.0.0/16, ACR with private endpoint.
Networking: VNet peering between hub and spoke, UDR on the AKS subnet sending 0.0.0.0/0 to the firewall, outbound type set to userDefinedRouting.
Ingress: AGIC with internal Application Gateway, exposed to on-prem via the VPN, WAF policy blocking OWASP top-10 attacks.
DNS: Private DNS zone linked to both VNets, ExternalDNS creating records in a private Azure DNS zone for internal service discovery.
Network Policies: Calico with deny-all default in every namespace, explicit allow rules for each service-to-service path, GlobalNetworkPolicy for cluster-wide rules (allow kube-system, allow monitoring).

This architecture passes SOC 2 and PCI-DSS audits with zero public network exposure while maintaining full developer productivity through Azure Bastion and az aks command invoke.

📝 Summary

Kubenet is lightweight for dev/test; Azure CNI gives direct VNet integration; Azure CNI Overlay is the best default for production (saves IPs); CNI Cilium adds eBPF observability.
IP planning: Use the formula (nodes × max-pods) + nodes + 5 for Azure CNI; Overlay decouples pod IPs from the VNet.
Azure Standard Load Balancer handles L4 traffic; add NGINX Ingress or AGIC for L7 routing and TLS termination.
CoreDNS handles internal resolution; ExternalDNS automates Azure DNS record creation.
Network Policies are essential — start with deny-all and add allow rules per service pair.
Private clusters remove all public endpoints; access requires VPN, Bastion, or command invoke.
Hub-spoke topology with Azure Firewall is the standard enterprise pattern for regulated workloads.

← Previous: Node Pools Next: Scaling & Autoscaling →

← Back to AKS Topics