Hands-on Lesson 14 of 14

AKS Interview Preparation

40+ real interview questions covering AKS fundamentals, architecture, networking, security, scaling, CI/CD, and troubleshooting — with production-grade answers that demonstrate deep understanding.

💡

How to Use This

Don't just memorize answers. For each question, try to answer it out loud before reading the answer. In real interviews, you'll need to explain concepts clearly and confidently. Focus on the "why" behind each answer, not just the "what".

🔰 Section 1 — AKS Fundamentals (Questions 1–10)

Q1: What is AKS and what does "managed" mean?

Answer: AKS is Azure Kubernetes Service — Azure's managed Kubernetes offering. "Managed" means Azure owns the control plane: the API server, etcd, scheduler, and controller manager. They're highly available, auto-patched, and you never see the VMs running them. You only manage the worker nodes (node pools) and your workloads. The control plane is free — you only pay for worker node VMs.

Q2: What's the difference between the control plane and the data plane in AKS?

Answer:

Component	Control Plane (Azure-managed)	Data Plane (You manage)
Contains	API server, etcd, scheduler, controller manager	Worker nodes, your pods, volumes, networking
Cost	Free (Standard/Premium tiers available for SLA)	You pay for VMs, disks, networking
Upgrades	Azure patches; you trigger K8s version upgrades	OS image upgrades (auto or manual)
Access	via kubectl / API — you never SSH into control plane	SSH possible (but discouraged), exec into pods

Q3: What is the MC_ resource group?

Answer: When you create an AKS cluster in a resource group like myRG, Azure creates a second resource group called MC_myRG_myCluster_eastus. This contains all the infrastructure resources — VMSS (node pools), load balancers, public IPs, NSGs, route tables, and the VNet (if Azure-managed). You should never manually modify resources in the MC_ group — AKS reconciles them automatically and your changes would be overwritten.

Q4: How does AKS differ from EKS and GKE?

Answer:

Cost: AKS control plane is free. EKS charges ~$73/month. GKE Autopilot is free; GKE Standard charges $73/month.
Identity: AKS has native Azure AD integration. EKS uses IAM. GKE uses Google Workspace.
Networking: AKS offers kubenet and Azure CNI. EKS has VPC CNI. GKE has native VPC.
Ease: GKE Autopilot is the easiest (fully managed nodes). AKS and EKS give more control but require more decisions.

Q5: What identity types does AKS use and why?

Answer: AKS uses Managed Identities (system-assigned or user-assigned) instead of service principals. The cluster identity (control plane) needs permissions to manage Azure resources (e.g., create load balancers, read from ACR). The kubelet identity is used by nodes to pull images from ACR. Using managed identities avoids credential rotation headaches — Azure handles the token lifecycle automatically.

Q6: Explain AKS pricing tiers (Free, Standard, Premium).

Answer:

Tier	SLA	Features	Cost
Free	No SLA	Basic features, good for dev/test	$0
Standard	99.95% (with AZ) / 99.9%	SLA, more API server capacity	~$73/month
Premium	99.95%+	Long-term support, advanced networking, AKS Automatic	~$146/month

In interviews, emphasize: "Free tier for dev/test, Standard for production where uptime matters, Premium for enterprise compliance and LTS."

Q7: What happens during an AKS upgrade?

Answer: AKS upgrades are rolling. First the control plane is upgraded (API server, etcd — this is seamless). Then nodes are upgraded one at a time: AKS cordons a node, drains pods (respecting PodDisruptionBudgets), upgrades the node OS and kubelet, then uncordons it. The max-surge setting controls how many extra nodes are created during upgrade to maintain capacity. Best practice: set max-surge=1 or 33% and always have PDBs on critical workloads.

Q8: What is the difference between system and user node pools?

Answer:

System pool: Runs critical system pods (CoreDNS, metrics-server, kube-proxy). Must exist. Requires at least 1 node. Has a CriticalAddonsOnly taint by default (when set to system mode).
User pool: Runs your application workloads. Can be scaled to 0. Can have different VM sizes, spot instances, and custom taints/labels.

Best practice: Separate system and user pools. System pool uses reliable VMs (Standard_D2s_v5). User pool can use cost-effective or specialized VMs (GPU, spot).

Q9: Can you scale an AKS node pool to zero?

Answer: User node pools can be scaled to 0 — useful for cost optimization (e.g., GPU pools that only run during ML training). System node pools cannot be scaled to 0 because they must run system pods. Minimum 1 node in the system pool at all times.

Q10: How do you connect kubectl to your AKS cluster?

Answer: az aks get-credentials --resource-group myRG --name myCluster. This merges the cluster's kubeconfig into your ~/.kube/config. With Azure AD integration, the first kubectl command will trigger a browser login for authentication. For CI/CD, use --admin flag (cluster admin credentials) or a kubelogin plugin with a service principal.

🌐 Section 2 — Networking (Questions 11–18)

Q11: Explain kubenet vs Azure CNI.

Answer:

Feature	Kubenet	Azure CNI
Pod IPs	Private range, NAT'd to node IP	Real VNet IPs for every pod
IP consumption	Low (1 IP per node)	High (1 IP per pod)
VNet integration	Limited — pods aren't directly routable	Full — pods get VNet IPs
Network Policies	Calico only	Calico or Azure NPM
Best for	Small clusters, IP-constrained environments	Production, VNet-peered architectures

Interview tip: "In most production scenarios, we use Azure CNI because pods need to be directly reachable from other VNet resources like databases, VMs, and private endpoints."

Q12: How does the AKS load balancer work?

Answer: When you create a Kubernetes Service of type LoadBalancer, AKS automatically provisions an Azure Load Balancer in the MC_ resource group. It creates a frontend IP (public or internal), a backend pool (node VMSS instances), and health probes. Traffic flows: Client → Azure LB → Node (NodePort) → kube-proxy/iptables → Pod. For internal services, use service.beta.kubernetes.io/azure-load-balancer-internal: "true" annotation.

Q13: What is an Ingress Controller and why do you need one on AKS?

Answer: An Ingress Controller is a reverse proxy that routes HTTP/HTTPS traffic to backend services based on hostname and path rules. Without it, each service would need its own LoadBalancer (= its own public IP = more cost). With an Ingress Controller (NGINX, Traefik, or Azure Application Gateway), you have one Load Balancer routing to many services. It also handles TLS termination, path-based routing, rate limiting, and authentication.

Q14: How do network policies work in AKS?

Answer: Network policies are Kubernetes resources that control pod-to-pod traffic at L3/L4. By default, all pods can communicate with all other pods. Network policies act as firewall rules — you define which pods can talk to which pods on which ports. AKS supports two engines: Calico (works with both kubenet and Azure CNI) and Azure NPM (Azure CNI only). Once you create any NetworkPolicy in a namespace, all traffic not explicitly allowed is denied (default-deny behavior).

Q15: What is Azure Private Link / Private Cluster?

Answer: A private AKS cluster disables the public endpoint of the API server. The API server gets a private IP from your VNet instead. Access is only possible from within the VNet or through VPN/ExpressRoute. This is required for strict compliance environments where the control plane must not be internet-accessible. You enable it with --enable-private-cluster during creation.

Q16: Explain DNS resolution inside AKS.

Answer: CoreDNS runs as pods in kube-system and provides DNS for the cluster. All pods have /etc/resolv.conf pointing to the CoreDNS ClusterIP (typically 10.0.0.10). A service myservice in namespace myns resolves as myservice.myns.svc.cluster.local. CoreDNS can be customized with ConfigMaps for forwarding external domains to custom DNS servers.

Q17: How do you expose a service externally on AKS?

Answer: Three ways, from simplest to most production-ready:

Service type LoadBalancer: Azure creates a public LB with a public IP. Quick, but each service gets its own IP.
Ingress Controller (NGINX/Traefik): One LB + Ingress rules for path/host routing. Most common.
Azure Application Gateway Ingress Controller (AGIC): Uses Azure's L7 load balancer natively. Supports WAF, autoscaling, and SSL offloading.

Q18: What is the difference between ClusterIP, NodePort, and LoadBalancer service types?

Answer:

Type	Accessible From	Use Case
ClusterIP	Inside the cluster only	Internal microservice communication
NodePort	Node IP + high port (30000-32767)	Testing, rarely used in prod on AKS
LoadBalancer	External via Azure LB	Production external traffic

📈 Section 3 — Scaling & Performance (Questions 19–24)

Q19: What is the Cluster Autoscaler and how does it work?

Answer: The Cluster Autoscaler watches for pods that can't be scheduled due to insufficient resources. When it finds pending pods, it tells Azure to add nodes to the VMSS (scale-out). When nodes are underutilized for 10+ minutes, it removes them (scale-in). It respects PodDisruptionBudgets during scale-in. Configure with --enable-cluster-autoscaler --min-count 2 --max-count 10.

Q20: How do HPA and Cluster Autoscaler work together?

Answer: They're complementary, not competing:

Load increases → HPA creates more pod replicas
New pods can't be scheduled (nodes full) → they go Pending
Cluster Autoscaler sees Pending pods → provisions new nodes
New nodes start → Pending pods get scheduled

The key: HPA scales pods (application layer), Cluster Autoscaler scales nodes (infrastructure layer).

Q21: What is KEDA and when would you use it over HPA?

Answer: KEDA (Kubernetes Event-Driven Autoscaling) scales based on external event sources — Azure Service Bus queue depth, Kafka lag, Prometheus metrics, cron schedules. HPA only scales on CPU/memory (or custom metrics with extra setup). KEDA is best for event-driven workloads: "Scale my worker pods to match the number of messages in the Azure Service Bus queue, and scale to 0 when idle."

Q22: What are Spot node pools?

Answer: Spot node pools use Azure Spot VMs — unused capacity at up to 90% discount. However, Azure can evict them at any time (30-second warning). Use them for: batch processing, CI/CD build agents, dev/test, stateless workers. Never for: stateful workloads, databases, or pods that can't tolerate interruption. Configure with --priority Spot --eviction-policy Delete --spot-max-price -1.

Q23: How do you right-size resource requests and limits?

Answer:

Requests = guaranteed resources; used for scheduling
Limits = maximum resources; enforced at runtime
Use Vertical Pod Autoscaler (VPA) in recommend mode to analyze actual usage
Look at Container Insights or Prometheus for P95 CPU/memory over 7 days
Set requests = P95 actual usage, limits = 1.5–2x requests
Never set CPU limits (causes throttling) — only set memory limits (prevents OOM cascading)

Q24: What are Virtual Nodes / ACI integration?

Answer: Virtual Nodes let AKS burst to Azure Container Instances — serverless containers with no node management. A pod scheduled on a virtual node runs in ACI within seconds (no waiting for VMs to provision). Great for: burst scaling, handling traffic spikes, running short-lived jobs. Limitation: no persistent volumes, limited networking, Linux only.

🔒 Section 4 — Security & RBAC (Questions 25–32)

Q25: How does Azure AD integration with AKS work?

Answer: AKS integrates with Azure AD (Entra ID) for authentication. When a user runs kubectl, they authenticate via Azure AD (browser flow or service principal). The API server validates the Azure AD token and maps the user to Kubernetes RBAC. You can grant Azure AD groups specific ClusterRoles — e.g., the "DevOps" AD group gets the edit ClusterRole in the production namespace.

Q26: Explain Kubernetes RBAC vs Azure RBAC for AKS.

Answer:

Aspect	Kubernetes RBAC	Azure RBAC
Scope	Inside the cluster (pods, services, secrets)	Azure resource level (AKS resource, resource group)
Managed by	kubectl / K8s manifests	Azure Portal / az CLI
Examples	"User X can read pods in namespace Y"	"User X can stop/start the AKS cluster"
Best for	Workload-level access control	Infrastructure-level access control

Production pattern: Use Azure RBAC for who can manage the AKS resource. Use Kubernetes RBAC for what they can do inside the cluster.

Q27: What is Workload Identity?

Answer: Workload Identity lets a Kubernetes pod authenticate to Azure services (Key Vault, Storage, SQL) using a federated Azure AD identity — no secrets stored in the cluster. It replaces the older pod-managed identity (aad-pod-identity). The flow: Kubernetes ServiceAccount → OIDC federation → Azure Managed Identity → Azure resource access. It's the recommended way to access Azure resources from AKS pods.

Q28: How do you manage secrets in AKS?

Answer: Three approaches (from basic to production):

Kubernetes Secrets: base64-encoded (NOT encrypted by default). Fine for non-sensitive config. Enable encryption at rest with Azure Key Vault KMS.
Azure Key Vault + CSI driver: Secrets stored in Key Vault, mounted as files/env vars in pods via the Secrets Store CSI Driver. Centralized management, audit logging, rotation support.
External Secrets Operator: Syncs secrets from Key Vault into Kubernetes Secrets automatically. Best for teams that want K8s-native Secret references but centralized storage.

Q29: What is Pod Security Admission (PSA)?

Answer: PSA replaced PodSecurityPolicies (removed in K8s 1.25). It enforces security standards at the namespace level using three profiles: Privileged (no restrictions), Baseline (blocks known escalation paths), and Restricted (hardened — no root, no hostPath, drop all capabilities). You set it with namespace labels: pod-security.kubernetes.io/enforce: restricted.

Q30: How do you restrict egress traffic from AKS pods?

Answer: Two layers:

Network Policies: Block pod-to-pod and pod-to-external traffic at the K8s level
Azure Firewall / UDR: Route all outbound traffic through Azure Firewall for deep packet inspection. Use a UserDefinedRouting (UDR) outbound type on the AKS cluster and route the default 0.0.0.0/0 through the firewall subnet.

Q31: What Microsoft Defender for Containers does for AKS?

Answer: It provides: runtime threat detection (crypto mining, reverse shells), vulnerability scanning of container images in ACR, security recommendations for cluster configuration, and alerts for suspicious Kubernetes API calls. It's enabled at the subscription level and integrates with Azure Security Center.

Q32: Explain the principle of least privilege for AKS.

Answer:

Users: Azure AD + K8s RBAC → give minimal ClusterRole per namespace
Pods: Workload Identity → scoped to only the Azure resources the app needs
Nodes: Managed Identity for kubelet → only AcrPull by default
Network: Default-deny NetworkPolicy → only allow required traffic
Containers: Run as non-root, drop all capabilities, read-only root filesystem

🛠️ Section 5 — Operations & Troubleshooting (Questions 33–40)

Q33: How do you monitor an AKS cluster?

Answer: Three layers:

Azure Monitor + Container Insights: Built-in. Node/pod metrics, log collection to Log Analytics. Good for basic monitoring and alerting.
Prometheus + Grafana: Industry standard. Use Azure Managed Prometheus (no infra to manage) + Managed Grafana for dashboards.
Application-level: OpenTelemetry for distributed tracing, Application Insights for APM.

Q34: A pod is stuck in Pending — walk me through how you'd debug it.

Answer:

kubectl describe pod <name> → read the Events section
If "Insufficient cpu/memory" → kubectl top nodes to check capacity → scale the node pool or reduce resource requests
If "no nodes match node affinity" → check nodeSelector/affinity rules vs actual node labels
If "0/N nodes available: N had taint" → check pod tolerations vs node taints
If PVC-related → kubectl get pvc to check if the volume is bound

Q35: What's the difference between liveness and readiness probes?

Answer:

Probe	Purpose	On Failure
Liveness	Is the container alive?	Kubelet restarts the container
Readiness	Is the container ready to serve traffic?	Remove from Service endpoints (no traffic routed)
Startup	Has the container finished starting?	Delays liveness/readiness checks for slow-starting apps

Common mistake: Using the same endpoint for liveness and readiness. If your app temporarily can't reach the database, readiness should fail (stop traffic), but liveness should still pass (don't restart — the DB will come back).

Q36: How do you do zero-downtime deployments on AKS?

Answer: Combine these practices:

Rolling update strategy: maxSurge=25%, maxUnavailable=0 (always maintain current capacity)
Readiness probes: New pods only receive traffic after passing health checks
PodDisruptionBudgets: Ensure minimum available during upgrades/scaling
preStop hooks: sleep 5 to allow load balancer to deregister the pod before termination
Graceful shutdown: Handle SIGTERM in your app, finish in-flight requests

Q37: What is GitOps and how is it used with AKS?

Answer: GitOps means Git is the single source of truth for cluster state. A GitOps operator running inside the cluster (Flux v2 or ArgoCD) continuously reconciles the desired state (Git repo) with the actual state (cluster). Benefits: audit trail (git log), rollback (git revert), no direct kubectl access needed in production. AKS has built-in Flux v2 support via the microsoft.flux extension.

Q38: How do you handle multi-environment (dev/staging/prod) on AKS?

Answer: Two patterns:

Namespace per environment: One cluster, separate namespaces with resource quotas and network policies. Cost-effective, but blast radius is the whole cluster.
Cluster per environment: Separate AKS clusters for dev, staging, prod. Higher cost, but complete isolation. Recommended for production.

With either pattern, use Helm values files per environment (values-dev.yaml, values-prod.yaml) and GitOps for automated deployment.

Q39: How would you migrate a workload from a VM to AKS?

Answer: Step-by-step approach:

Containerize: Create a Dockerfile, ensure the app can run as a non-root container
Externalize config: Move from files/env on the VM to ConfigMaps and Secrets
Externalize state: Move persistent data to Azure managed services (Azure SQL, Redis Cache, Storage)
CI/CD: Build pipeline to build image → push to ACR → deploy to AKS
Gradual cutover: Run both VM and AKS versions, shift traffic with Azure Traffic Manager or Front Door
Decommission: Once validated, shut down the VM

Q40: What would you check if the AKS API server is slow or unresponsive?

Answer:

Check Azure status: status.azure.com for regional outages
Check SLA tier: Free tier has no SLA — the API server can be slow under load. Upgrade to Standard.
Check authorized IP ranges: Your IP might not be in the allowed list
Check for chatty workloads: Excessive watch/list calls from controllers can overload the API server. Check with kubectl get --raw /metrics | grep apiserver_request_total
Check cluster size: Very large clusters (1000+ nodes) may need Premium tier
Run diagnostics: az aks show to check provisioning state

🎯 Bonus: Scenario-Based Questions

Scenario 1: "Design an AKS architecture for a multi-team e-commerce platform."

Framework for answering:

Cluster: Standard tier, private cluster, Azure CNI, 3 AZs
Node pools: System pool (D4s_v5), General pool (D8s_v5), Spot pool for batch jobs
Networking: Hub-spoke VNet, Azure Firewall for egress, AGIC for ingress with WAF
Security: Azure AD + K8s RBAC, Workload Identity, Key Vault CSI, NetworkPolicies (default deny), Defender for Containers
CI/CD: GitHub Actions → ACR → AKS via Flux v2 GitOps
Monitoring: Azure Managed Prometheus + Managed Grafana + Container Insights
Multi-team: Namespace per team with ResourceQuotas and LimitRanges

Scenario 2: "Your deployment succeeded but the app returns 500 errors."

Debugging flow:

Check pod logs → kubectl logs <pod> -n <ns>
Is it all pods or just one? → If one, exec in and check
Check dependent services (DB, Redis, external APIs) → kubectl exec + curl
Check if ConfigMaps/Secrets have the right values → kubectl get cm/secret -o yaml
Check if the new image version has a bug → helm rollback to previous version
If rollback fixes it → the issue is in the code, not the infrastructure

📝 Interview Tips

Structure your answers: What → Why → How → Real example
Always mention trade-offs: "kubenet saves IPs but limits direct pod routing"
Show operational maturity: Mention monitoring, alerting, and rollback strategies
Know the Azure specifics: MC_ resource group, managed identities, ACR attach, Azure AD integration
Practice the debugging flow: Outside-in (cluster → namespace → pod → container → network → Azure)

💡

Congratulations!

You've completed the entire AKS — Zero to Hero course! You now have the knowledge to create, manage, secure, scale, and troubleshoot AKS clusters in production. Review the hands-on labs regularly and practice on a real Azure subscription.

← Debugging Scenarios Back to Course Home →

← Back to AKS Course