Advanced Lesson 9 of 14

Deploy to AKS

Build Docker images, push to Azure Container Registry, and deploy to Azure Kubernetes Service using Helm — the full production pipeline.

🧒 Simple Explanation (ELI5)

Imagine you run a delivery company.

Building the Docker image is like packing your product into a sturdy shipping box — everything the customer needs is sealed inside.
Tagging the image is printing a shipping label with the exact version number so you always know which box is which.
Pushing to ACR (Azure Container Registry) is dropping the box off at the warehouse. It's stored safely, ready to be picked up by any delivery truck.
GitHub Actions is the delivery truck. Every time new code is merged, it automatically drives to the warehouse, picks up the latest box, and heads to the destination.
AKS (Azure Kubernetes Service) is the customer's address — the live cluster where your application runs for real users.
Helm is the delivery instructions taped to the box — it tells AKS exactly how to unpack, configure, and run your app.

Put it all together: code is merged → the truck picks up the box from the warehouse → drives it to the address using the delivery instructions → your app is live. No human intervention needed.

🔧 Pipeline Overview

The end-to-end deployment pipeline follows a clear, linear flow:

text

Code Push → Lint & Test → Build Docker Image → Push to ACR → Helm Deploy → AKS Cluster
                                                                    ↓
                                                              Staging → Smoke Tests → Production (with approval)

Prerequisites

AKS cluster — a running Kubernetes cluster in Azure
ACR registry — an Azure Container Registry to store Docker images
Authentication — a Service Principal or OIDC Federated Credential with permissions to push to ACR and deploy to AKS
Helm chart — a chart in your repository (e.g., ./charts/myapp/) that describes your Kubernetes resources

💡

Cross-Reference

Already know AKS? See our AKS Course for cluster management deep-dives. Need Helm basics? See our Helm Course for charts, templating, and values.

🔐 Azure Authentication

Before your workflow can push images or deploy to AKS, it must authenticate with Azure. There are two primary methods:

Method 1: Service Principal (Client Secret)

Create a Service Principal and store its credentials as GitHub secrets:

AZURE_CLIENT_ID — the app (client) ID
AZURE_CLIENT_SECRET — the client secret
AZURE_TENANT_ID — your Azure AD tenant ID
AZURE_SUBSCRIPTION_ID — the target subscription

yaml

- uses: azure/login@v2
  with:
    creds: |
      {
        "clientId": "${{ secrets.AZURE_CLIENT_ID }}",
        "clientSecret": "${{ secrets.AZURE_CLIENT_SECRET }}",
        "tenantId": "${{ secrets.AZURE_TENANT_ID }}",
        "subscriptionId": "${{ secrets.AZURE_SUBSCRIPTION_ID }}"
      }

Method 2: OIDC Federated Credentials (Recommended)

OIDC eliminates stored secrets entirely. GitHub's token is exchanged for an Azure access token using a trust relationship configured in Azure AD.

No client secret to rotate or leak
Short-lived tokens scoped to the workflow run
Requires configuring a Federated Credential on the App Registration

yaml

permissions:
  id-token: write   # Required for OIDC
  contents: read

steps:
  - uses: azure/login@v2
    with:
      client-id: ${{ secrets.AZURE_CLIENT_ID }}
      tenant-id: ${{ secrets.AZURE_TENANT_ID }}
      subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

Getting AKS Credentials

After Azure login, fetch the kubeconfig so kubectl and helm can talk to your cluster:

yaml

- run: az aks get-credentials --resource-group ${{ vars.RESOURCE_GROUP }} --cluster-name ${{ vars.CLUSTER_NAME }} --overwrite-existing

⚠️

OIDC is Strongly Recommended

Service Principal secrets can be leaked, forgotten, or expire. OIDC federated credentials are short-lived, automatically scoped to the exact workflow, and never stored in GitHub. Use OIDC for all new setups unless your Azure AD version doesn't support it.

🐳 Build & Push to ACR

After authenticating, build the Docker image and push it to Azure Container Registry. Tag with both the commit SHA (immutable, traceable) and latest (convenience).

yaml

- uses: azure/docker-login@v1
  with:
    login-server: ${{ vars.ACR_NAME }}.azurecr.io
    username: ${{ secrets.ACR_USERNAME }}
    password: ${{ secrets.ACR_PASSWORD }}

- uses: docker/build-push-action@v5
  with:
    push: true
    tags: |
      ${{ vars.ACR_NAME }}.azurecr.io/myapp:${{ github.sha }}
      ${{ vars.ACR_NAME }}.azurecr.io/myapp:latest
    cache-from: type=gha
    cache-to: type=gha,mode=max

💡

Always Tag with Commit SHA

Using ${{ github.sha }} as the primary tag creates an immutable, auditable link between your Git commit and the deployed image. You can always trace exactly which code is running in production. The latest tag is a convenience alias — never rely on it for production deployments.

⎈ Helm Deploy to AKS

With the image in ACR and kubeconfig configured, use Helm to deploy (or upgrade) the application on AKS:

yaml

- uses: azure/aks-set-context@v3
  with:
    resource-group: ${{ vars.RESOURCE_GROUP }}
    cluster-name: ${{ vars.CLUSTER_NAME }}

- run: |
    helm upgrade --install myapp ./charts/myapp \
      --namespace production \
      --create-namespace \
      --set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
      --set image.tag=${{ github.sha }} \
      --set ingress.host=myapp.example.com \
      --wait --timeout 5m

Flag	Purpose
`upgrade --install`	Idempotent — installs if new, upgrades if already exists
`--namespace production`	Deploy into a dedicated namespace for isolation
`--create-namespace`	Auto-create the namespace if it doesn't exist yet
`--set image.tag=$SHA`	Pin to the exact commit's Docker image
`--wait`	Block until all pods are Ready — catches crash loops early
`--timeout 5m`	Fail the step if pods aren't ready within 5 minutes

📋 Complete Production Workflow

Here's a full workflow combining lint, test, build, and multi-stage deployment with environments and manual approval:

yaml

# .github/workflows/deploy-aks.yml
name: Build & Deploy to AKS
on:
  push:
    branches: [main]

permissions:
  id-token: write
  contents: read

jobs:
  lint-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm test

  build-and-push:
    needs: lint-and-test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - uses: azure/docker-login@v1
        with:
          login-server: ${{ vars.ACR_NAME }}.azurecr.io
          username: ${{ secrets.ACR_USERNAME }}
          password: ${{ secrets.ACR_PASSWORD }}

      - uses: docker/setup-buildx-action@v3
      - uses: docker/build-push-action@v5
        with:
          context: .
          push: true
          tags: |
            ${{ vars.ACR_NAME }}.azurecr.io/myapp:${{ github.sha }}
            ${{ vars.ACR_NAME }}.azurecr.io/myapp:latest
          cache-from: type=gha
          cache-to: type=gha,mode=max

  deploy-staging:
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      - uses: azure/setup-helm@v3
        with:
          version: v3.14.0

      - run: |
          helm upgrade --install myapp ./charts/myapp \
            --namespace staging \
            --create-namespace \
            --set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
            --set image.tag=${{ github.sha }} \
            --set ingress.host=staging.myapp.example.com \
            --atomic --wait --timeout 5m

  smoke-test:
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - run: |
          echo "Running smoke tests against staging..."
          for i in 1 2 3 4 5; do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://staging.myapp.example.com/health)
            if [ "$STATUS" = "200" ]; then
              echo "Health check passed (attempt $i)"
              exit 0
            fi
            echo "Attempt $i: got $STATUS, retrying in 10s..."
            sleep 10
          done
          echo "Smoke test failed after 5 attempts"
          exit 1

  deploy-production:
    needs: smoke-test
    runs-on: ubuntu-latest
    environment: production   # Requires manual approval
    steps:
      - uses: actions/checkout@v4

      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      - uses: azure/setup-helm@v3
        with:
          version: v3.14.0

      - run: |
          helm upgrade --install myapp ./charts/myapp \
            --namespace production \
            --create-namespace \
            --set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
            --set image.tag=${{ github.sha }} \
            --set ingress.host=myapp.example.com \
            --atomic --wait --timeout 5m

🔄 Deployment Strategies

Production deployments need more than just helm upgrade. Choose the right strategy based on your risk tolerance and traffic patterns.

Strategy 1: Rolling Update (Default)

Helm's default — new pods start while old pods terminate. Zero-downtime when configured correctly.

yaml

# In your Helm chart's deployment.yaml
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0        # Never kill old pods before new ones are ready
      maxSurge: 1              # Add one extra pod during rollout
  template:
    spec:
      containers:
        - name: myapp
          readinessProbe:        # CRITICAL — traffic only routes when ready
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
          lifecycle:
            preStop:             # Allow in-flight requests to finish
              exec:
                command: ["sleep", "15"]

Strategy 2: Blue-Green Deployment

Two identical environments — deploy to the inactive one, switch traffic instantly, keep the old one for rollback.

yaml

# GitHub Actions workflow — Blue-Green via Helm
  deploy-green:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      # Step 1: Deploy to GREEN slot (inactive)
      - name: Deploy to green slot
        run: |
          helm upgrade --install myapp-green ./charts/myapp \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --set slot=green \
            --set service.enabled=false \
            --atomic --wait --timeout 5m

      # Step 2: Smoke test green slot via port-forward or internal service
      - name: Smoke test green
        run: |
          kubectl port-forward svc/myapp-green 8080:80 -n production &
          sleep 5
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
          kill %1
          if [ "$STATUS" != "200" ]; then
            echo "::error::Green slot health check failed (HTTP $STATUS)"
            helm uninstall myapp-green -n production
            exit 1
          fi
          echo "Green slot is healthy ✅"

      # Step 3: Switch traffic from blue → green
      - name: Switch traffic to green
        run: |
          kubectl patch service myapp -n production \
            -p '{"spec":{"selector":{"slot":"green"}}}'
          echo "Traffic switched to green ✅"

      # Step 4: Keep blue as rollback target (scale down after 30 min)
      - name: Schedule blue cleanup
        run: |
          echo "Blue slot kept for rollback. Remove manually after validation:"
          echo "  helm uninstall myapp-blue -n production"

Strategy 3: Canary Deployment

Route a small percentage of traffic to the new version, monitor metrics, then promote or rollback.

yaml

  canary-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
      - uses: azure/aks-set-context@v3
        with:
          resource-group: ${{ vars.RESOURCE_GROUP }}
          cluster-name: ${{ vars.CLUSTER_NAME }}

      # Deploy canary with 1 replica (stable has 5)
      - name: Deploy canary
        run: |
          helm upgrade --install myapp-canary ./charts/myapp \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --set replicaCount=1 \
            --set canary=true \
            --atomic --wait --timeout 5m

      # Monitor error rate for 5 minutes
      - name: Monitor canary health
        run: |
          echo "Monitoring canary for 5 minutes..."
          for i in $(seq 1 30); do
            STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
              https://myapp.example.com/health)
            ERRORS=$(kubectl logs -l app=myapp,canary=true \
              -n production --tail=50 | grep -c "ERROR" || true)
            echo "Check $i/30 — HTTP: $STATUS, Errors: $ERRORS"
            if [ "$ERRORS" -gt 5 ]; then
              echo "::error::Canary error rate too high — rolling back"
              helm uninstall myapp-canary -n production
              exit 1
            fi
            sleep 10
          done
          echo "Canary healthy after 5 minutes ✅"

      # Promote: update stable release to new version
      - name: Promote to stable
        run: |
          helm upgrade --install myapp ./charts/myapp \
            --namespace production \
            --set image.tag=${{ github.sha }} \
            --set replicaCount=5 \
            --atomic --wait --timeout 5m
          # Remove canary
          helm uninstall myapp-canary -n production

Rollback Strategies

Method	Command / Config	When to Use
Automatic (--atomic)	`helm upgrade --atomic`	Failed deploy auto-reverts to previous release
Manual Helm rollback	`helm rollback myapp <revision>`	Post-deploy issue discovered after workflow completes
Re-deploy previous SHA	Re-run workflow with old commit's tag	When you know which exact version was good
Blue-green switch-back	`kubectl patch svc … selector: slot: blue`	Instant traffic switch to previous environment
GitOps revert	`git revert <bad-commit>` → pipeline redeploys	When using GitOps workflow — Git is source of truth

⚠️

Always Test Rollbacks

A rollback strategy that has never been tested is not a rollback strategy. Periodically run helm rollback in staging to verify it works — check that database migrations are backward-compatible and that no breaking config changes prevent the old version from starting.

Strategy Comparison

Strategy	Downtime Risk	Rollback Speed	Complexity	Best For
Rolling Update	None (if probes set)	~2–5 min	Low	Most applications
Blue-Green	None	Instant	Medium	Mission-critical services
Canary	Partial (% of traffic)	Fast (uninstall canary)	High	High-traffic services, risk mitigation

📊 Deployment Flow Diagram

text

┌──────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────┐
│ Git Push │───▶│ Lint & Test  │───▶│ Docker Build │───▶│ ACR Push │
└──────────┘    └──────────────┘    └──────────────┘    └────┬─────┘
                                                             │
                    ┌────────────────────────────────────────┘
                    ▼
            ┌──────────────┐    ┌──────────────┐    ┌──────────────────┐
            │ Helm Deploy  │───▶│ Smoke Tests  │───▶│  Manual Approval │
            │  (Staging)   │    │  (Staging)   │    │                  │
            └──────────────┘    └──────────────┘    └────────┬─────────┘
                                                             │
                                              ┌──────────────┼──────────────┐
                                              ▼              ▼              ▼
                                      ┌────────────┐ ┌────────────┐ ┌────────────┐
                                      │  Rolling   │ │ Blue-Green │ │  Canary    │
                                      │  Update    │ │ Switch     │ │  → Promote │
                                      └────────────┘ └────────────┘ └────────────┘

🛠️ Hands-on Lab

Lab 1: Configure Azure Credentials

Create a Service Principal: az ad sp create-for-rbac --name "github-actions-sp" --role contributor --scopes /subscriptions/<SUB_ID>
Store the output values as GitHub secrets: AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID
Add repository variables: ACR_NAME, RESOURCE_GROUP, CLUSTER_NAME
Test authentication by adding azure/login@v2 + az account show to a test workflow

Lab 2: Build & Push to ACR

Create a simple Dockerfile in your repository (Node.js, Python, or any runtime)
Add a workflow that builds the image and pushes to ACR with ${{ github.sha }} tag
Trigger the workflow and verify the image appears in ACR: az acr repository show-tags --name <ACR_NAME> --repository myapp
Pull the image locally and run it to confirm it works: docker run <ACR_NAME>.azurecr.io/myapp:<SHA>

Lab 3: Deploy to AKS with Helm

Create a basic Helm chart: helm create charts/myapp
Update values.yaml to use your ACR image and ingress settings
Add the Helm deploy step to your workflow, targeting a staging namespace
Push to main and watch the deployment in the Actions tab
Verify pods are running: kubectl get pods -n staging

Lab 4: Smoke Tests Post-Deploy

Add a smoke test job that runs after staging deployment
Curl the health endpoint and assert a 200 response
Configure the production environment with required reviewers in GitHub Settings → Environments
Push a change and verify the workflow pauses at the production gate until approved

🐛 Debugging Common Issues

"ACR login failed"

Wrong credentials: Double-check ACR_USERNAME and ACR_PASSWORD secrets. For admin-enabled ACR, the username is the registry name and the password is from the ACR Access Keys blade
Service Principal expired: SP secrets have a default expiry (1–2 years). Regenerate with az ad sp credential reset and update the GitHub secret
Firewall rules: If ACR has network restrictions, GitHub-hosted runners may be blocked. Use a self-hosted runner inside the VNet or allow GitHub's IP ranges

"Helm deploy timeout"

Pod crash loop: The container is starting and failing repeatedly. Check kubectl logs <pod> -n <namespace> for startup errors (missing env vars, bad config)
Image pull error: AKS can't pull from ACR. Verify the ACR-AKS integration: az aks check-acr --name <CLUSTER> --resource-group <RG> --acr <ACR_NAME>.azurecr.io
Resource limits: Pod requests more CPU/memory than the node can provide. Check kubectl describe pod <pod> for scheduling failures
Readiness probe failing: The --wait flag waits for readiness probes. If your probe endpoint isn't implemented or returns errors, the deploy will timeout

"az aks get-credentials failed"

Wrong resource group or cluster name: Verify with az aks list -o table
Insufficient permissions: The SP needs at minimum Azure Kubernetes Service Cluster User Role on the AKS resource
Cluster not running: If the cluster is stopped, get-credentials succeeds but kubectl commands fail. Start it with az aks start

"ImagePullBackOff" — AKS Can't Pull from ACR

ACR not attached to AKS: Run az aks check-acr --name <CLUSTER> --resource-group <RG> --acr <ACR_NAME>.azurecr.io. If not attached: az aks update -n <CLUSTER> -g <RG> --attach-acr <ACR_NAME>
Image tag doesn't exist: Verify: az acr repository show-tags --name <ACR_NAME> --repository myapp -o table
Private ACR with network rules: AKS nodes must have network access to ACR. Check firewall rules and private endpoint config

"OIDC token request failed"

Missing id-token: write — OIDC requires this permission. Add it to the workflow or job permissions: block
Federated credential subject mismatch: The subject claim must match exactly. For environment-scoped deployments use repo:ORG/REPO:environment:production, not ref:refs/heads/main
Audience mismatch: Azure expects api://AzureADTokenExchange as the audience. Verify your federated credential config

Deployment Debugging Decision Tree

text

Helm deploy failed?
├── "ACR login failed"
│   ├── Credentials wrong → Check ACR_USERNAME/PASSWORD or OIDC config
│   ├── SP expired → az ad sp credential reset
│   └── Firewall blocking → Whitelist GitHub runner IPs or use self-hosted
│
├── "UPGRADE FAILED" / timeout
│   ├── kubectl get pods → ImagePullBackOff?
│   │   ├── Image tag missing → Check ACR tags
│   │   └── ACR not attached → az aks update --attach-acr
│   ├── kubectl get pods → CrashLoopBackOff?
│   │   ├── kubectl logs <pod> → App startup error
│   │   └── kubectl describe pod → Missing env vars, bad config
│   ├── kubectl get pods → Pending?
│   │   └── kubectl describe pod → Resource quota exceeded / no nodes
│   └── kubectl get pods → Running but not Ready?
│       └── Readiness probe failing → Check /health endpoint
│
├── "OIDC token request failed"
│   ├── Missing permissions: id-token: write
│   ├── Subject claim mismatch → Check federated credential
│   └── Audience mismatch → Verify api://AzureADTokenExchange
│
└── Workflow succeeded but app shows old version
    ├── Wrong namespace → Check --namespace flag
    ├── Image tag not updated → Check --set image.tag value
    └── Cached image → Check imagePullPolicy: Always

🎯 Interview Questions

Basic (5)

1. What is Azure Container Registry (ACR) and why is it used in CI/CD?

ACR is a managed Docker container registry hosted in Azure. In CI/CD, it serves as the central storage for Docker images built during the pipeline. It integrates natively with AKS, supports geo-replication, vulnerability scanning, and RBAC — making it the natural choice for Azure-based Kubernetes deployments.

2. What does helm upgrade --install do?

It's an idempotent deploy command. If the release doesn't exist yet, it performs helm install. If it already exists, it performs helm upgrade. This means the same command works for both first-time deployments and updates, which is ideal for CI/CD where you don't want to track release state.

3. Why tag Docker images with the Git commit SHA?

The commit SHA creates an immutable, one-to-one link between the source code and the deployed image. You can always trace exactly which code is running in any environment. Unlike latest or version tags, the SHA never changes — two different commits can never produce the same tag.

4. What are GitHub Environments used for in deployment workflows?

Environments define deployment targets (staging, production) with protection rules. You can require manual approval, restrict which branches can deploy, add wait timers, and scope secrets/variables per environment. This creates a controlled promotion path from staging to production.

5. What is the --wait flag in Helm and why is it important in CI/CD?

The --wait flag tells Helm to block until all deployed resources (pods, services, etc.) reach a Ready state. In CI/CD, this is critical because without it, the workflow would report success immediately even if pods are crash-looping. Combined with --timeout, it ensures the pipeline fails fast if the deployment is broken.

Intermediate (5)

6. Compare Service Principal vs OIDC authentication for GitHub Actions to Azure.

Service Principal uses a client secret stored as a GitHub secret — it works universally but the secret can leak, must be rotated, and is long-lived. OIDC (OpenID Connect) uses GitHub's built-in token, exchanged for an Azure access token via a trust relationship. No secret is stored, tokens are short-lived and scoped to the workflow run. OIDC is more secure and recommended for all new setups.

7. How do you perform a Helm rollback in a CI/CD pipeline when a deployment fails?

Use the --atomic flag in your helm upgrade command. If the upgrade fails (pods don't become ready within the timeout), Helm automatically rolls back to the previous release. Alternatively, add a failure handler step that runs helm rollback <release> <revision> using if: failure(). You can get the previous revision number from helm history.

8. Explain the role of the azure/aks-set-context action.

This action configures the kubeconfig for the workflow so that subsequent kubectl and helm commands target the correct AKS cluster. It fetches cluster credentials using the authenticated Azure session and sets the KUBECONFIG environment variable. Without it, Helm would have no cluster to deploy to.

9. How would you deploy the same Helm chart to multiple environments with different configurations?

Use environment-specific values files (values-staging.yaml, values-production.yaml) and pass them via --values. Combine with GitHub Environments to scope variables per environment — vars.INGRESS_HOST in staging resolves to staging.myapp.com, in production to myapp.com. The chart stays identical; only the values change.

10. What permissions does a Service Principal need for a GitHub Actions → ACR → AKS pipeline?

Minimum: AcrPush role on the ACR (to push images), Azure Kubernetes Service Cluster User Role on the AKS cluster (to get credentials and deploy), and Reader on the resource group. For OIDC, you also need a Federated Credential configured on the App Registration pointing to your GitHub repo and branch.

Senior (5)

11. Design a blue-green deployment strategy for AKS using GitHub Actions and Helm.

Maintain two namespaces (or label sets): blue (current live) and green (new version). The pipeline deploys to the inactive set using Helm, runs comprehensive smoke tests, then switches the ingress/service selector from blue to green. If smoke tests fail, no traffic shift occurs. After successful cutover, the old set is kept as an instant rollback target. Implement via Helm values: --set slot=green controls labels; a separate step updates the ingress annotation or service mesh routing rule.

12. How would you implement canary deployments to AKS from GitHub Actions?

Use a canary Helm release alongside the stable release. Deploy with --set replicaCount=1 for the canary version while keeping the stable release at full scale. Configure an ingress controller (like Nginx with canary annotations) or a service mesh (like Istio) to route a percentage of traffic (e.g., 5%) to the canary pods. Monitor error rates and latency in the smoke test job. If metrics look good, gradually increase the canary weight in subsequent workflow steps. If metrics degrade, run helm uninstall on the canary release.

13. A microservices team has 12 services deployed to AKS via GitHub Actions. How do you structure the pipelines?

Use a monorepo with path filters: each service triggers only when its directory changes (on.push.paths: ['services/auth/**']). Share a reusable workflow (.github/workflows/deploy-service.yml) that accepts inputs: service name, chart path, namespace. Each service's workflow calls the reusable one with its specific parameters. Use a matrix strategy for shared infrastructure components. Implement dependency ordering via needs: for services with startup dependencies. Centralize Helm chart templates in a shared library chart.

14. Your OIDC authentication works in staging but fails in production. What do you investigate?

Check: (1) The Federated Credential's subject filter — it may be scoped to ref:refs/heads/main but production uses a different branch or tag. (2) Environment-scoped Federated Credentials may restrict which environments can authenticate. (3) The permissions: id-token: write must be declared at the job or workflow level. (4) The Azure App Registration may have Conditional Access policies that differ by environment. (5) Check GitHub's OIDC token claims using curl $ACTIONS_ID_TOKEN_REQUEST_URL to see exactly what's being sent.

15. How do you ensure zero-downtime deployments to AKS via Helm in CI/CD?

Multiple layers: (1) RollingUpdate strategy in the Deployment with maxUnavailable: 0 — new pods start before old ones terminate. (2) Proper readiness probes so traffic only routes to healthy pods. (3) preStop lifecycle hooks with a sleep to allow in-flight requests to complete before pod termination. (4) Pod Disruption Budgets to prevent too many pods going down simultaneously. (5) --atomic flag in Helm so a failed upgrade auto-rolls back. (6) Connection draining configured on the ingress controller. All of these are configured in the Helm chart's templates and values.

🏭 Real-World Scenario

A fintech startup runs 12 microservices on AKS, all deployed via GitHub Actions and Helm. Here's how their pipeline evolved:

Phase 1 — Manual deploys: Engineers ran helm upgrade from their laptops. Different developers had different kubeconfigs, sometimes deploying debug builds to production. Deployments were infrequent (weekly) because they were risky and time-consuming.

Phase 2 — Basic CI/CD: A single workflow built and deployed all 12 services on every push to main. Build times ballooned to 45 minutes. A bug in one service blocked deployment of all others.

Phase 3 — Optimized pipeline (current):

Path-filtered triggers — each service only builds when its directory changes, reducing average pipeline time to 6 minutes
Reusable workflow — a single deploy-service.yml accepts service name, namespace, and chart path as inputs. All 12 services call the same workflow
Staging auto-deploy — every merge to main automatically deploys to the staging namespace. Smoke tests run against staging endpoints
Production manual approval — the production environment requires approval from two senior engineers. Deployment window is enforced via environment protection rules
OIDC authentication — eliminated all stored Azure secrets. Federated credentials are scoped per environment
Helm --atomic — three production incidents were automatically rolled back before users noticed, thanks to the atomic flag and proper readiness probes

Result: deployments went from weekly and risky to 20+ per day with zero downtime. Mean time to production dropped from 5 days to 15 minutes.

📝 Summary

Pipeline flow: Code Push → Lint & Test → Docker Build → ACR Push → Helm Deploy → AKS (staging → production)
Azure auth: Use azure/login@v2 with OIDC (recommended) or Service Principal credentials stored as secrets
ACR push: Tag images with ${{ github.sha }} for immutable traceability; use Docker layer caching (type=gha) for speed
Helm deploy: helm upgrade --install with --atomic --wait --timeout 5m for safe, idempotent deployments
Environments: Use GitHub Environments to separate staging (auto-deploy) and production (manual approval)
Smoke tests: Always validate staging before promoting to production — curl health endpoints, check response codes
OIDC over SP: Federated credentials eliminate secret rotation and reduce leak risk
Debugging: Most failures trace to auth issues (expired SP, wrong credentials), image pull errors (ACR-AKS integration), or pod crashes (check logs and describe)

← Artifacts & Caching Security & Permissions →

← Back to GitHub Actions Course