Advanced Lesson 9 of 14

Deploy to AKS

Build Docker images, push to Azure Container Registry, and deploy to Azure Kubernetes Service using Helm β€” the full production pipeline.

πŸ§’ Simple Explanation (ELI5)

Imagine you run a delivery company.

Put it all together: code is merged β†’ the truck picks up the box from the warehouse β†’ drives it to the address using the delivery instructions β†’ your app is live. No human intervention needed.

πŸ”§ Pipeline Overview

The end-to-end deployment pipeline follows a clear, linear flow:

text
Code Push β†’ Lint & Test β†’ Build Docker Image β†’ Push to ACR β†’ Helm Deploy β†’ AKS Cluster
                                                                    ↓
                                                              Staging β†’ Smoke Tests β†’ Production (with approval)

Prerequisites

πŸ’‘
Cross-Reference

Already know AKS? See our AKS Course for cluster management deep-dives. Need Helm basics? See our Helm Course for charts, templating, and values.

πŸ” Azure Authentication

Before your workflow can push images or deploy to AKS, it must authenticate with Azure. There are two primary methods:

Method 1: Service Principal (Client Secret)

Create a Service Principal and store its credentials as GitHub secrets:

yaml
- uses: azure/login@v2
  with:
    creds: |
      {
        "clientId": "${{ secrets.AZURE_CLIENT_ID }}",
        "clientSecret": "${{ secrets.AZURE_CLIENT_SECRET }}",
        "tenantId": "${{ secrets.AZURE_TENANT_ID }}",
        "subscriptionId": "${{ secrets.AZURE_SUBSCRIPTION_ID }}"
      }

Method 2: OIDC Federated Credentials (Recommended)

OIDC eliminates stored secrets entirely. GitHub's token is exchanged for an Azure access token using a trust relationship configured in Azure AD.

yaml
permissions:
  id-token: write   # Required for OIDC
  contents: read

steps:
  - uses: azure/login@v2
    with:
      client-id: ${{ secrets.AZURE_CLIENT_ID }}
      tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

Getting AKS Credentials

After Azure login, fetch the kubeconfig so kubectl and helm can talk to your cluster:

yaml
- run: az aks get-credentials --resource-group ${{ vars.RESOURCE_GROUP }} --cluster-name ${{ vars.CLUSTER_NAME }} --overwrite-existing
⚠️
OIDC is Strongly Recommended

Service Principal secrets can be leaked, forgotten, or expire. OIDC federated credentials are short-lived, automatically scoped to the exact workflow, and never stored in GitHub. Use OIDC for all new setups unless your Azure AD version doesn't support it.

🐳 Build & Push to ACR

After authenticating, build the Docker image and push it to Azure Container Registry. Tag with both the commit SHA (immutable, traceable) and latest (convenience).

yaml
- uses: azure/docker-login@v1
  with:
    login-server: ${{ vars.ACR_NAME }}.azurecr.io
    username: ${{ secrets.ACR_USERNAME }}
    password: ${{ secrets.ACR_PASSWORD }}

- uses: docker/build-push-action@v5
  with:
    push: true
    tags: |
      ${{ vars.ACR_NAME }}.azurecr.io/myapp:${{ github.sha }}
      ${{ vars.ACR_NAME }}.azurecr.io/myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max
πŸ’‘
Always Tag with Commit SHA

Using ${{ github.sha }} as the primary tag creates an immutable, auditable link between your Git commit and the deployed image. You can always trace exactly which code is running in production. The latest tag is a convenience alias β€” never rely on it for production deployments.

⎈ Helm Deploy to AKS

With the image in ACR and kubeconfig configured, use Helm to deploy (or upgrade) the application on AKS:

yaml
- uses: azure/aks-set-context@v3
  with:
    resource-group: ${{ vars.RESOURCE_GROUP }}
    cluster-name: ${{ vars.CLUSTER_NAME }}

- run: |
    helm upgrade --install myapp ./charts/myapp \
      --namespace production \
      --create-namespace \
      --set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
      --set image.tag=${{ github.sha }} \
      --set ingress.host=myapp.example.com \
      --wait --timeout 5m
FlagPurpose
upgrade --installIdempotent β€” installs if new, upgrades if already exists
--namespace productionDeploy into a dedicated namespace for isolation
--create-namespaceAuto-create the namespace if it doesn't exist yet
--set image.tag=$SHAPin to the exact commit's Docker image
--waitBlock until all pods are Ready β€” catches crash loops early
--timeout 5mFail the step if pods aren't ready within 5 minutes

πŸ“‹ Complete Production Workflow

Here's a full workflow combining lint, test, build, and multi-stage deployment with environments and manual approval:

yaml
# .github/workflows/deploy-aks.yml
name: Build & Deploy to AKS
on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm test

  build-and-push:
    needs: lint-and-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - uses: azure/docker-login@v1
        with:
          login-server: ${{ vars.ACR_NAME }}.azurecr.io
          username: ${{ secrets.ACR_USERNAME }}
          password: ${{ secrets.ACR_PASSWORD }}

      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ${{ vars.ACR_NAME }}.azurecr.io/myapp:${{ github.sha }}
            ${{ vars.ACR_NAME }}.azurecr.io/myapp:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      - uses: azure/setup-helm@v3
        with:
          version: v3.14.0

      - run: |
          helm upgrade --install myapp ./charts/myapp \
            --namespace staging \
            --create-namespace \
            --set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
            --set image.tag=${{ github.sha }} \
            --set ingress.host=staging.myapp.example.com \
            --atomic --wait --timeout 5m

  smoke-test:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - run: |
          echo "Running smoke tests against staging..."
          for i in 1 2 3 4 5; do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://staging.myapp.example.com/health)
            if [ "$STATUS" = "200" ]; then
              echo "Health check passed (attempt $i)"
              exit 0
            fi
            echo "Attempt $i: got $STATUS, retrying in 10s..."
            sleep 10
          done
          echo "Smoke test failed after 5 attempts"
          exit 1

  deploy-production:
    needs: smoke-test
    runs-on: ubuntu-latest
    environment: production   # Requires manual approval
    steps:
      - uses: actions/checkout@v4

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      - uses: azure/setup-helm@v3
        with:
          version: v3.14.0

      - run: |
          helm upgrade --install myapp ./charts/myapp \
            --namespace production \
            --create-namespace \
            --set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
            --set image.tag=${{ github.sha }} \
            --set ingress.host=myapp.example.com \
            --atomic --wait --timeout 5m

πŸ”„ Deployment Strategies

Production deployments need more than just helm upgrade. Choose the right strategy based on your risk tolerance and traffic patterns.

Strategy 1: Rolling Update (Default)

Helm's default β€” new pods start while old pods terminate. Zero-downtime when configured correctly.

yaml
# In your Helm chart's deployment.yaml
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0        # Never kill old pods before new ones are ready
      maxSurge: 1              # Add one extra pod during rollout
  template:
    spec:
      containers:
        - name: myapp
          readinessProbe:        # CRITICAL β€” traffic only routes when ready
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          lifecycle:
            preStop:             # Allow in-flight requests to finish
              exec:
                command: ["sleep", "15"]

Strategy 2: Blue-Green Deployment

Two identical environments β€” deploy to the inactive one, switch traffic instantly, keep the old one for rollback.

yaml
# GitHub Actions workflow β€” Blue-Green via Helm
  deploy-green:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      # Step 1: Deploy to GREEN slot (inactive)
      - name: Deploy to green slot
        run: |
          helm upgrade --install myapp-green ./charts/myapp \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --set slot=green \
            --set service.enabled=false \
            --atomic --wait --timeout 5m

      # Step 2: Smoke test green slot via port-forward or internal service
      - name: Smoke test green
        run: |
          kubectl port-forward svc/myapp-green 8080:80 -n production &
          sleep 5
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
          kill %1
          if [ "$STATUS" != "200" ]; then
            echo "::error::Green slot health check failed (HTTP $STATUS)"
            helm uninstall myapp-green -n production
            exit 1
          fi
          echo "Green slot is healthy βœ…"

      # Step 3: Switch traffic from blue β†’ green
      - name: Switch traffic to green
        run: |
          kubectl patch service myapp -n production \
            -p '{"spec":{"selector":{"slot":"green"}}}'
          echo "Traffic switched to green βœ…"

      # Step 4: Keep blue as rollback target (scale down after 30 min)
      - name: Schedule blue cleanup
        run: |
          echo "Blue slot kept for rollback. Remove manually after validation:"
          echo "  helm uninstall myapp-blue -n production"

Strategy 3: Canary Deployment

Route a small percentage of traffic to the new version, monitor metrics, then promote or rollback.

yaml
  canary-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      # Deploy canary with 1 replica (stable has 5)
      - name: Deploy canary
        run: |
          helm upgrade --install myapp-canary ./charts/myapp \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --set replicaCount=1 \
            --set canary=true \
            --atomic --wait --timeout 5m

      # Monitor error rate for 5 minutes
      - name: Monitor canary health
        run: |
          echo "Monitoring canary for 5 minutes..."
          for i in $(seq 1 30); do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
              https://myapp.example.com/health)
            ERRORS=$(kubectl logs -l app=myapp,canary=true \
              -n production --tail=50 | grep -c "ERROR" || true)
            echo "Check $i/30 β€” HTTP: $STATUS, Errors: $ERRORS"
            if [ "$ERRORS" -gt 5 ]; then
              echo "::error::Canary error rate too high β€” rolling back"
              helm uninstall myapp-canary -n production
              exit 1
            fi
            sleep 10
          done
          echo "Canary healthy after 5 minutes βœ…"

      # Promote: update stable release to new version
      - name: Promote to stable
        run: |
          helm upgrade --install myapp ./charts/myapp \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --set replicaCount=5 \
            --atomic --wait --timeout 5m
          # Remove canary
          helm uninstall myapp-canary -n production

Rollback Strategies

MethodCommand / ConfigWhen to Use
Automatic (--atomic)helm upgrade --atomicFailed deploy auto-reverts to previous release
Manual Helm rollbackhelm rollback myapp <revision>Post-deploy issue discovered after workflow completes
Re-deploy previous SHARe-run workflow with old commit's tagWhen you know which exact version was good
Blue-green switch-backkubectl patch svc … selector: slot: blueInstant traffic switch to previous environment
GitOps revertgit revert <bad-commit> β†’ pipeline redeploysWhen using GitOps workflow β€” Git is source of truth
⚠️
Always Test Rollbacks

A rollback strategy that has never been tested is not a rollback strategy. Periodically run helm rollback in staging to verify it works β€” check that database migrations are backward-compatible and that no breaking config changes prevent the old version from starting.

Strategy Comparison

StrategyDowntime RiskRollback SpeedComplexityBest For
Rolling UpdateNone (if probes set)~2–5 minLowMost applications
Blue-GreenNoneInstantMediumMission-critical services
CanaryPartial (% of traffic)Fast (uninstall canary)HighHigh-traffic services, risk mitigation

πŸ“Š Deployment Flow Diagram

text
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Git Push │───▢│ Lint & Test  │───▢│ Docker Build │───▢│ ACR Push β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
                                                             β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ Helm Deploy  │───▢│ Smoke Tests  │───▢│  Manual Approval β”‚
            β”‚  (Staging)   β”‚    β”‚  (Staging)   β”‚    β”‚                  β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                             β”‚
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β–Ό              β–Ό              β–Ό
                                      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                      β”‚  Rolling   β”‚ β”‚ Blue-Green β”‚ β”‚  Canary    β”‚
                                      β”‚  Update    β”‚ β”‚ Switch     β”‚ β”‚  β†’ Promote β”‚
                                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Hands-on Lab

Lab 1: Configure Azure Credentials

  1. Create a Service Principal: az ad sp create-for-rbac --name "github-actions-sp" --role contributor --scopes /subscriptions/<SUB_ID>
  2. Store the output values as GitHub secrets: AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID
  3. Add repository variables: ACR_NAME, RESOURCE_GROUP, CLUSTER_NAME
  4. Test authentication by adding azure/login@v2 + az account show to a test workflow

Lab 2: Build & Push to ACR

  1. Create a simple Dockerfile in your repository (Node.js, Python, or any runtime)
  2. Add a workflow that builds the image and pushes to ACR with ${{ github.sha }} tag
  3. Trigger the workflow and verify the image appears in ACR: az acr repository show-tags --name <ACR_NAME> --repository myapp
  4. Pull the image locally and run it to confirm it works: docker run <ACR_NAME>.azurecr.io/myapp:<SHA>

Lab 3: Deploy to AKS with Helm

  1. Create a basic Helm chart: helm create charts/myapp
  2. Update values.yaml to use your ACR image and ingress settings
  3. Add the Helm deploy step to your workflow, targeting a staging namespace
  4. Push to main and watch the deployment in the Actions tab
  5. Verify pods are running: kubectl get pods -n staging

Lab 4: Smoke Tests Post-Deploy

  1. Add a smoke test job that runs after staging deployment
  2. Curl the health endpoint and assert a 200 response
  3. Configure the production environment with required reviewers in GitHub Settings β†’ Environments
  4. Push a change and verify the workflow pauses at the production gate until approved

πŸ› Debugging Common Issues

"ACR login failed"

"Helm deploy timeout"

"az aks get-credentials failed"

"ImagePullBackOff" β€” AKS Can't Pull from ACR

"OIDC token request failed"

Deployment Debugging Decision Tree

text
Helm deploy failed?
β”œβ”€β”€ "ACR login failed"
β”‚   β”œβ”€β”€ Credentials wrong β†’ Check ACR_USERNAME/PASSWORD or OIDC config
β”‚   β”œβ”€β”€ SP expired β†’ az ad sp credential reset
β”‚   └── Firewall blocking β†’ Whitelist GitHub runner IPs or use self-hosted
β”‚
β”œβ”€β”€ "UPGRADE FAILED" / timeout
β”‚   β”œβ”€β”€ kubectl get pods β†’ ImagePullBackOff?
β”‚   β”‚   β”œβ”€β”€ Image tag missing β†’ Check ACR tags
β”‚   β”‚   └── ACR not attached β†’ az aks update --attach-acr
β”‚   β”œβ”€β”€ kubectl get pods β†’ CrashLoopBackOff?
β”‚   β”‚   β”œβ”€β”€ kubectl logs <pod> β†’ App startup error
β”‚   β”‚   └── kubectl describe pod β†’ Missing env vars, bad config
β”‚   β”œβ”€β”€ kubectl get pods β†’ Pending?
β”‚   β”‚   └── kubectl describe pod β†’ Resource quota exceeded / no nodes
β”‚   └── kubectl get pods β†’ Running but not Ready?
β”‚       └── Readiness probe failing β†’ Check /health endpoint
β”‚
β”œβ”€β”€ "OIDC token request failed"
β”‚   β”œβ”€β”€ Missing permissions: id-token: write
β”‚   β”œβ”€β”€ Subject claim mismatch β†’ Check federated credential
β”‚   └── Audience mismatch β†’ Verify api://AzureADTokenExchange
β”‚
└── Workflow succeeded but app shows old version
    β”œβ”€β”€ Wrong namespace β†’ Check --namespace flag
    β”œβ”€β”€ Image tag not updated β†’ Check --set image.tag value
    └── Cached image β†’ Check imagePullPolicy: Always

🎯 Interview Questions

Basic (5)

1. What is Azure Container Registry (ACR) and why is it used in CI/CD?

ACR is a managed Docker container registry hosted in Azure. In CI/CD, it serves as the central storage for Docker images built during the pipeline. It integrates natively with AKS, supports geo-replication, vulnerability scanning, and RBAC β€” making it the natural choice for Azure-based Kubernetes deployments.

2. What does helm upgrade --install do?

It's an idempotent deploy command. If the release doesn't exist yet, it performs helm install. If it already exists, it performs helm upgrade. This means the same command works for both first-time deployments and updates, which is ideal for CI/CD where you don't want to track release state.

3. Why tag Docker images with the Git commit SHA?

The commit SHA creates an immutable, one-to-one link between the source code and the deployed image. You can always trace exactly which code is running in any environment. Unlike latest or version tags, the SHA never changes β€” two different commits can never produce the same tag.

4. What are GitHub Environments used for in deployment workflows?

Environments define deployment targets (staging, production) with protection rules. You can require manual approval, restrict which branches can deploy, add wait timers, and scope secrets/variables per environment. This creates a controlled promotion path from staging to production.

5. What is the --wait flag in Helm and why is it important in CI/CD?

The --wait flag tells Helm to block until all deployed resources (pods, services, etc.) reach a Ready state. In CI/CD, this is critical because without it, the workflow would report success immediately even if pods are crash-looping. Combined with --timeout, it ensures the pipeline fails fast if the deployment is broken.

Intermediate (5)

6. Compare Service Principal vs OIDC authentication for GitHub Actions to Azure.

Service Principal uses a client secret stored as a GitHub secret β€” it works universally but the secret can leak, must be rotated, and is long-lived. OIDC (OpenID Connect) uses GitHub's built-in token, exchanged for an Azure access token via a trust relationship. No secret is stored, tokens are short-lived and scoped to the workflow run. OIDC is more secure and recommended for all new setups.

7. How do you perform a Helm rollback in a CI/CD pipeline when a deployment fails?

Use the --atomic flag in your helm upgrade command. If the upgrade fails (pods don't become ready within the timeout), Helm automatically rolls back to the previous release. Alternatively, add a failure handler step that runs helm rollback <release> <revision> using if: failure(). You can get the previous revision number from helm history.

8. Explain the role of the azure/aks-set-context action.

This action configures the kubeconfig for the workflow so that subsequent kubectl and helm commands target the correct AKS cluster. It fetches cluster credentials using the authenticated Azure session and sets the KUBECONFIG environment variable. Without it, Helm would have no cluster to deploy to.

9. How would you deploy the same Helm chart to multiple environments with different configurations?

Use environment-specific values files (values-staging.yaml, values-production.yaml) and pass them via --values. Combine with GitHub Environments to scope variables per environment β€” vars.INGRESS_HOST in staging resolves to staging.myapp.com, in production to myapp.com. The chart stays identical; only the values change.

10. What permissions does a Service Principal need for a GitHub Actions β†’ ACR β†’ AKS pipeline?

Minimum: AcrPush role on the ACR (to push images), Azure Kubernetes Service Cluster User Role on the AKS cluster (to get credentials and deploy), and Reader on the resource group. For OIDC, you also need a Federated Credential configured on the App Registration pointing to your GitHub repo and branch.

Senior (5)

11. Design a blue-green deployment strategy for AKS using GitHub Actions and Helm.

Maintain two namespaces (or label sets): blue (current live) and green (new version). The pipeline deploys to the inactive set using Helm, runs comprehensive smoke tests, then switches the ingress/service selector from blue to green. If smoke tests fail, no traffic shift occurs. After successful cutover, the old set is kept as an instant rollback target. Implement via Helm values: --set slot=green controls labels; a separate step updates the ingress annotation or service mesh routing rule.

12. How would you implement canary deployments to AKS from GitHub Actions?

Use a canary Helm release alongside the stable release. Deploy with --set replicaCount=1 for the canary version while keeping the stable release at full scale. Configure an ingress controller (like Nginx with canary annotations) or a service mesh (like Istio) to route a percentage of traffic (e.g., 5%) to the canary pods. Monitor error rates and latency in the smoke test job. If metrics look good, gradually increase the canary weight in subsequent workflow steps. If metrics degrade, run helm uninstall on the canary release.

13. A microservices team has 12 services deployed to AKS via GitHub Actions. How do you structure the pipelines?

Use a monorepo with path filters: each service triggers only when its directory changes (on.push.paths: ['services/auth/**']). Share a reusable workflow (.github/workflows/deploy-service.yml) that accepts inputs: service name, chart path, namespace. Each service's workflow calls the reusable one with its specific parameters. Use a matrix strategy for shared infrastructure components. Implement dependency ordering via needs: for services with startup dependencies. Centralize Helm chart templates in a shared library chart.

14. Your OIDC authentication works in staging but fails in production. What do you investigate?

Check: (1) The Federated Credential's subject filter β€” it may be scoped to ref:refs/heads/main but production uses a different branch or tag. (2) Environment-scoped Federated Credentials may restrict which environments can authenticate. (3) The permissions: id-token: write must be declared at the job or workflow level. (4) The Azure App Registration may have Conditional Access policies that differ by environment. (5) Check GitHub's OIDC token claims using curl $ACTIONS_ID_TOKEN_REQUEST_URL to see exactly what's being sent.

15. How do you ensure zero-downtime deployments to AKS via Helm in CI/CD?

Multiple layers: (1) RollingUpdate strategy in the Deployment with maxUnavailable: 0 β€” new pods start before old ones terminate. (2) Proper readiness probes so traffic only routes to healthy pods. (3) preStop lifecycle hooks with a sleep to allow in-flight requests to complete before pod termination. (4) Pod Disruption Budgets to prevent too many pods going down simultaneously. (5) --atomic flag in Helm so a failed upgrade auto-rolls back. (6) Connection draining configured on the ingress controller. All of these are configured in the Helm chart's templates and values.

🏭 Real-World Scenario

A fintech startup runs 12 microservices on AKS, all deployed via GitHub Actions and Helm. Here's how their pipeline evolved:

Phase 1 β€” Manual deploys: Engineers ran helm upgrade from their laptops. Different developers had different kubeconfigs, sometimes deploying debug builds to production. Deployments were infrequent (weekly) because they were risky and time-consuming.

Phase 2 β€” Basic CI/CD: A single workflow built and deployed all 12 services on every push to main. Build times ballooned to 45 minutes. A bug in one service blocked deployment of all others.

Phase 3 β€” Optimized pipeline (current):

Result: deployments went from weekly and risky to 20+ per day with zero downtime. Mean time to production dropped from 5 days to 15 minutes.

πŸ“ Summary

← Back to GitHub Actions Course