Advanced Lesson 10 of 14

AKS in CI/CD

Build automated deployment pipelines for AKS using Azure DevOps, GitHub Actions, and GitOps with Flux v2 — covering build, push, deploy, rollback, and promotion strategies across environments.

🧒 Simple Explanation (ELI5)

CI/CD with AKS is like a factory assembly line:

The beauty is that no human touches the assembly line after the developer pushes code. Everything is automated, repeatable, and auditable. Push code → tested container deployed to production in minutes.

🔧 Technical Explanation

1. Azure DevOps Pipelines for AKS

Azure DevOps provides native AKS integration through service connections and built-in tasks.

yaml
# azure-pipelines.yml — Build + Deploy to AKS via Helm
trigger:
  branches:
    include:
    - main

variables:
  acrName: 'myacr'
  acrLoginServer: 'myacr.azurecr.io'
  imageRepository: 'myapp'
  tag: '$(Build.BuildId)'
  aksResourceGroup: 'myRG'
  aksClusterName: 'myAKS'
  helmChartPath: './charts/myapp'
  namespace: 'production'

stages:
- stage: Build
  displayName: 'Build & Push Image'
  jobs:
  - job: BuildImage
    pool:
      vmImage: 'ubuntu-latest'
    steps:
    - task: Docker@2
      displayName: 'Build and Push to ACR'
      inputs:
        containerRegistry: 'acr-service-connection'
        repository: '$(imageRepository)'
        command: 'buildAndPush'
        Dockerfile: '**/Dockerfile'
        tags: |
          $(tag)
          latest

- stage: Deploy
  displayName: 'Deploy to AKS'
  dependsOn: Build
  jobs:
  - deployment: DeployHelm
    environment: 'production'
    pool:
      vmImage: 'ubuntu-latest'
    strategy:
      runOnce:
        deploy:
          steps:
          - task: HelmInstaller@0
            displayName: 'Install Helm'
            inputs:
              helmVersion: '3.14.0'
          - task: AzureCLI@2
            displayName: 'Deploy via Helm'
            inputs:
              azureSubscription: 'azure-service-connection'
              scriptType: 'bash'
              scriptLocation: 'inlineScript'
              inlineScript: |
                az aks get-credentials -g $(aksResourceGroup) -n $(aksClusterName)
                helm upgrade --install myapp $(helmChartPath) \
                  --namespace $(namespace) \
                  --create-namespace \
                  --set image.repository=$(acrLoginServer)/$(imageRepository) \
                  --set image.tag=$(tag) \
                  --wait --timeout 5m

2. GitHub Actions for AKS

GitHub Actions uses OIDC federation with Azure (workload identity federation) — no secrets stored in GitHub.

yaml
# .github/workflows/deploy-aks.yml
name: Build and Deploy to AKS

on:
  push:
    branches: [main]

permissions:
  id-token: write    # Required for OIDC
  contents: read

env:
  ACR_NAME: myacr
  ACR_LOGIN_SERVER: myacr.azurecr.io
  IMAGE_NAME: myapp
  AKS_RESOURCE_GROUP: myRG
  AKS_CLUSTER_NAME: myAKS
  HELM_CHART_PATH: ./charts/myapp
  NAMESPACE: production

jobs:
  build-and-push:
    name: Build & Push to ACR
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.version }}
    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Azure Login (OIDC)
      uses: azure/login@v2
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    - name: ACR Login
      run: az acr login --name ${{ env.ACR_NAME }}

    - name: Docker meta
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }}
        tags: |
          type=sha,prefix=
          type=ref,event=branch

    - name: Build and Push
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}

  deploy:
    name: Deploy to AKS
    needs: build-and-push
    runs-on: ubuntu-latest
    environment: production   # GitHub Environment for approvals
    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Azure Login (OIDC)
      uses: azure/login@v2
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    - name: Set AKS context
      uses: azure/aks-set-context@v4
      with:
        resource-group: ${{ env.AKS_RESOURCE_GROUP }}
        cluster-name: ${{ env.AKS_CLUSTER_NAME }}

    - name: Setup Helm
      uses: azure/setup-helm@v4
      with:
        version: '3.14.0'

    - name: Helm Lint
      run: helm lint ${{ env.HELM_CHART_PATH }}

    - name: Helm Dry-Run
      run: |
        helm upgrade --install myapp ${{ env.HELM_CHART_PATH }} \
          --namespace ${{ env.NAMESPACE }} \
          --set image.repository=${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }} \
          --set image.tag=${{ needs.build-and-push.outputs.image-tag }} \
          --dry-run

    - name: Deploy with Helm
      run: |
        helm upgrade --install myapp ${{ env.HELM_CHART_PATH }} \
          --namespace ${{ env.NAMESPACE }} \
          --create-namespace \
          --set image.repository=${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }} \
          --set image.tag=${{ needs.build-and-push.outputs.image-tag }} \
          --wait --timeout 5m

    - name: Run Helm Tests
      run: helm test myapp --namespace ${{ env.NAMESPACE }}

    - name: Verify Deployment
      run: |
        kubectl rollout status deployment/myapp -n ${{ env.NAMESPACE }} --timeout=3m
        kubectl get pods -n ${{ env.NAMESPACE }} -l app=myapp
💡
OIDC Authentication

Set up OIDC federation between GitHub and Azure: create an Azure AD app registration, add a federated credential for your GitHub repo/branch, and store the client ID, tenant ID, and subscription ID as GitHub secrets. This eliminates the need for long-lived client secrets — tokens are short-lived and scoped to each workflow run.

3. Deployment Flow

CI/CD Pipeline Flow
Git Push
CI: Build + Test
Push to ACR
Approval Gate
Helm Deploy to AKS
Helm Test + Verify

4. GitOps with Flux v2

GitOps flips the deployment model: instead of a pipeline pushing to the cluster, Flux pulls desired state from Git and reconciles the cluster to match.

bash
# Enable GitOps (Flux v2) on AKS
az aks update -g myRG -n myAKS --enable-gitops

# Create a Flux configuration pointing to your Git repo
az k8s-configuration flux create \
  --resource-group myRG \
  --cluster-name myAKS \
  --cluster-type managedClusters \
  --name myapp-config \
  --namespace flux-system \
  --scope cluster \
  --url https://github.com/myorg/aks-gitops-config \
  --branch main \
  --kustomization name=infra path=./infrastructure prune=true \
  --kustomization name=apps path=./apps/production prune=true dependsOn=infra
yaml
# Example GitOps repo structure
# aks-gitops-config/
# ├── infrastructure/
# │   ├── namespaces.yaml
# │   ├── network-policies.yaml
# │   └── rbac.yaml
# └── apps/
#     ├── production/
#     │   ├── myapp-helmrelease.yaml
#     │   └── kustomization.yaml
#     └── staging/
#         ├── myapp-helmrelease.yaml
#         └── kustomization.yaml

---
# apps/production/myapp-helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  name: myapp
  namespace: production
spec:
  interval: 5m
  chart:
    spec:
      chart: myapp
      version: "1.2.x"         # Semver range — auto-upgrade patches
      sourceRef:
        kind: HelmRepository
        name: myacr-charts
        namespace: flux-system
  values:
    image:
      repository: myacr.azurecr.io/myapp
      tag: "abc1234"            # Updated by CI pipeline or image automation
    replicaCount: 3
    resources:
      requests:
        cpu: 250m
        memory: 256Mi
      limits:
        cpu: 500m
        memory: 512Mi
  upgrade:
    remediation:
      retries: 3
  rollback:
    cleanupOnFail: true
yaml
# HelmRepository source for ACR-hosted charts
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: myacr-charts
  namespace: flux-system
spec:
  interval: 10m
  url: oci://myacr.azurecr.io/helm
  type: oci
GitOps Golden Rule

With GitOps, no one runs kubectl apply directly in production. All changes go through Git (pull request → review → merge). Flux reconciles the cluster automatically. Manual changes are overwritten on the next reconciliation cycle. This provides a complete audit trail and easy rollback via git revert.

5. Blue-Green & Canary Deployments

Kubernetes and AKS provide several strategies beyond the default rolling update:

Canary with NGINX Ingress

yaml
# Main Ingress — routes to stable version
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-stable
  annotations:
    nginx.ingress.kubernetes.io/canary: "false"
spec:
  ingressClassName: nginx
  rules:
  - host: myapp.company.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-stable
            port:
              number: 80
---
# Canary Ingress — routes 10% of traffic to new version
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  ingressClassName: nginx
  rules:
  - host: myapp.company.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port:
              number: 80

Blue-Green with Service Swap

bash
# Deploy "green" alongside existing "blue"
helm install myapp-green ./charts/myapp \
  --namespace production \
  --set image.tag=v2.0 \
  --set service.name=myapp-green

# Run smoke tests against the green service
kubectl run smoke-test --rm -it --image=curlimages/curl -- \
  curl -s http://myapp-green.production.svc/health

# Swap the production Service selector to point to green pods
kubectl patch service myapp -n production \
  -p '{"spec":{"selector":{"version":"green"}}}'

# If issues arise — swap back to blue
kubectl patch service myapp -n production \
  -p '{"spec":{"selector":{"version":"blue"}}}'

# Clean up the old blue deployment after validation
helm uninstall myapp-blue -n production

6. Environment Promotion

Use the same Helm chart across environments — only change values:

Environment Promotion Pipeline
Dev AKS
Staging AKS
Production AKS
bash
# Same chart, different values per environment
helm upgrade --install myapp ./charts/myapp \
  -f values-dev.yaml \
  --namespace dev

helm upgrade --install myapp ./charts/myapp \
  -f values-staging.yaml \
  --namespace staging

helm upgrade --install myapp ./charts/myapp \
  -f values-production.yaml \
  --namespace production
yaml
# values-dev.yaml
replicaCount: 1
image:
  tag: "latest"
resources:
  requests:
    cpu: 100m
    memory: 128Mi

# values-production.yaml
replicaCount: 5
image:
  tag: "v1.8.3"            # Pinned, immutable tag
resources:
  requests:
    cpu: 500m
    memory: 512Mi
  limits:
    cpu: "1"
    memory: "1Gi"
autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 20

7. Service Connections & Authentication

PlatformAuth MethodConfiguration
Azure DevOpsService Connection (SPN or Managed Identity)Project Settings → Service connections → Azure Resource Manager
GitHub ActionsOIDC Workload Identity FederationAzure AD App Registration + Federated Credential for repo/branch
Self-hosted agentsManaged Identity on agent VM/VMSSAssign identity → az role assignment → no secrets needed
bash
# Set up OIDC for GitHub Actions
# 1. Create an app registration
APP_ID=$(az ad app create --display-name "github-aks-deploy" --query appId -o tsv)

# 2. Create a service principal
az ad sp create --id "$APP_ID"

# 3. Add federated credential for the GitHub repo
az ad app federated-credential create --id "$APP_ID" --parameters '{
  "name": "github-main-branch",
  "issuer": "https://token.actions.githubusercontent.com",
  "subject": "repo:myorg/myapp:ref:refs/heads/main",
  "audiences": ["api://AzureADTokenExchange"]
}'

# 4. Grant AKS and ACR access
AKS_ID=$(az aks show -g myRG -n myAKS --query id -o tsv)
ACR_ID=$(az acr show -n myacr --query id -o tsv)
az role assignment create --assignee "$APP_ID" --role "Azure Kubernetes Service Cluster User Role" --scope "$AKS_ID"
az role assignment create --assignee "$APP_ID" --role "AcrPush" --scope "$ACR_ID"

# 5. Store in GitHub Secrets: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID

8. Rollback Strategies

bash
# Helm rollback — revert to previous release
helm history myapp -n production
helm rollback myapp 3 -n production    # Roll back to revision 3
helm rollback myapp 0 -n production    # Roll back to the last successful release

# kubectl rollout undo — revert a Deployment
kubectl rollout undo deployment/myapp -n production
kubectl rollout undo deployment/myapp -n production --to-revision=5

# Check rollout status
kubectl rollout status deployment/myapp -n production

# Flux GitOps rollback — revert the Git commit
git revert HEAD    # Revert the last commit
git push origin main
# Flux detects the change and reconciles — rolls back the cluster automatically
⚠️
Helm Rollback Limitations

Helm rollback reverts the Kubernetes manifests but does NOT revert database migrations, config changes in Key Vault, or external service state. Always design your deployments to be backward-compatible with the previous schema/config, or use separate migration pipelines with rollback support.

9. Pre-deploy Validation

bash
# 1. Lint the chart
helm lint ./charts/myapp -f values-production.yaml

# 2. Template render — catch templating errors
helm template myapp ./charts/myapp -f values-production.yaml > /tmp/rendered.yaml

# 3. Validate against Kubernetes schema (kubeconform)
helm template myapp ./charts/myapp -f values-production.yaml | \
  kubeconform -strict -kubernetes-version 1.28.0

# 4. Dry-run against the actual cluster (catches RBAC, quota, admission issues)
helm upgrade --install myapp ./charts/myapp \
  -f values-production.yaml \
  -n production \
  --dry-run

# 5. Policy check — ensure the manifest passes all Gatekeeper constraints
helm template myapp ./charts/myapp -f values-production.yaml | \
  kubectl apply --dry-run=server -f -

⌨️ Hands-on

Lab: Complete GitHub Actions Workflow for AKS

This lab builds a full pipeline: checkout → build → push to ACR → deploy with Helm → verify.

bash
# Prerequisites: Set up OIDC federation (see section 7 above)
# Then create the workflow file:
mkdir -p .github/workflows
yaml
# .github/workflows/aks-full-pipeline.yml
name: Full AKS CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

permissions:
  id-token: write
  contents: read

env:
  ACR_NAME: myacr
  ACR_LOGIN_SERVER: myacr.azurecr.io
  IMAGE_NAME: myapp
  AKS_RG: myRG
  AKS_NAME: myAKS

jobs:
  # ──── CI: Build, Test, Scan ────
  ci:
    name: CI - Build & Test
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ github.sha }}
    steps:
    - uses: actions/checkout@v4

    - name: Run unit tests
      run: |
        npm ci
        npm test

    - name: Azure Login
      uses: azure/login@v2
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    - name: ACR Login
      run: az acr login --name ${{ env.ACR_NAME }}

    - name: Build and Push Image
      run: |
        docker build -t ${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }}:${{ github.sha }} .
        docker push ${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }}:${{ github.sha }}

    - name: Scan Image for Vulnerabilities
      run: |
        az acr repository show-tags -n ${{ env.ACR_NAME }} --repository ${{ env.IMAGE_NAME }} --top 1
        echo "Image pushed. Defender for Containers will scan automatically."

  # ──── Validate: Lint & Dry-Run ────
  validate:
    name: Validate Helm Chart
    needs: ci
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - name: Setup Helm
      uses: azure/setup-helm@v4

    - name: Helm Lint
      run: helm lint ./charts/myapp -f ./charts/myapp/values-production.yaml

    - name: Template & Schema Validate
      run: |
        helm template myapp ./charts/myapp \
          -f ./charts/myapp/values-production.yaml \
          --set image.tag=${{ needs.ci.outputs.image-tag }} \
          > /tmp/rendered.yaml
        # Install kubeconform
        curl -sSLo /tmp/kubeconform.tar.gz \
          https://github.com/yannh/kubeconform/releases/latest/download/kubeconform-linux-amd64.tar.gz
        tar xf /tmp/kubeconform.tar.gz -C /usr/local/bin
        kubeconform -strict -kubernetes-version 1.28.0 /tmp/rendered.yaml

  # ──── Deploy to Staging ────
  deploy-staging:
    name: Deploy to Staging
    needs: [ci, validate]
    runs-on: ubuntu-latest
    environment: staging
    steps:
    - uses: actions/checkout@v4

    - name: Azure Login
      uses: azure/login@v2
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    - uses: azure/aks-set-context@v4
      with:
        resource-group: ${{ env.AKS_RG }}
        cluster-name: ${{ env.AKS_NAME }}

    - uses: azure/setup-helm@v4

    - name: Deploy to staging
      run: |
        helm upgrade --install myapp ./charts/myapp \
          -f ./charts/myapp/values-staging.yaml \
          --namespace staging \
          --create-namespace \
          --set image.tag=${{ needs.ci.outputs.image-tag }} \
          --wait --timeout 5m

    - name: Smoke test
      run: |
        kubectl rollout status deployment/myapp -n staging --timeout=3m
        STAGING_IP=$(kubectl get svc myapp -n staging -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
        curl -sf "http://$STAGING_IP/health" || exit 1

  # ──── Deploy to Production (with approval) ────
  deploy-production:
    name: Deploy to Production
    needs: [ci, deploy-staging]
    runs-on: ubuntu-latest
    environment: production    # Requires manual approval in GitHub
    steps:
    - uses: actions/checkout@v4

    - name: Azure Login
      uses: azure/login@v2
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

    - uses: azure/aks-set-context@v4
      with:
        resource-group: ${{ env.AKS_RG }}
        cluster-name: ${{ env.AKS_NAME }}

    - uses: azure/setup-helm@v4

    - name: Deploy to production
      run: |
        helm upgrade --install myapp ./charts/myapp \
          -f ./charts/myapp/values-production.yaml \
          --namespace production \
          --create-namespace \
          --set image.tag=${{ needs.ci.outputs.image-tag }} \
          --wait --timeout 10m

    - name: Run Helm Tests
      run: helm test myapp --namespace production

    - name: Verify rollout
      run: |
        kubectl rollout status deployment/myapp -n production --timeout=5m
        kubectl get pods -n production -l app=myapp
        echo "✅ Deployment successful: ${{ needs.ci.outputs.image-tag }}"
💡
GitHub Environments

Configure the production environment in GitHub Settings → Environments with required reviewers. The pipeline pauses at the deploy-production job until an approved team member clicks "Approve." This provides a manual gate without breaking automation.

🐛 Debugging Scenarios

Scenario 1: "Pipeline can't connect to AKS"

Symptom: The GitHub Actions workflow fails at azure/aks-set-context with "Unable to connect to the server" or a 401 Unauthorized error.

bash
# Step 1: Verify OIDC credentials are correct
# Check GitHub Secrets: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID
# Must match the app registration

# Step 2: Check the federated credential configuration
az ad app federated-credential list --id "<APP_ID>" -o table
# Verify: issuer = "https://token.actions.githubusercontent.com"
# subject = "repo:myorg/myapp:ref:refs/heads/main" (match your repo/branch)

# Step 3: Check RBAC on the AKS resource
az role assignment list --assignee "<APP_ID>" --scope "/subscriptions/.../managedClusters/myAKS" -o table
# Must have "Azure Kubernetes Service Cluster User Role" or "Contributor"

# Step 4: Private cluster? GitHub-hosted runners can't reach private API endpoints
az aks show -g myRG -n myAKS --query "apiServerAccessProfile"
# If enablePrivateCluster=true → use self-hosted runners on a VM in the VNet
# Or use "command invoke": az aks command invoke -g myRG -n myAKS --command "kubectl get nodes"

# Step 5: Check if Azure AD integration requires the app to be in an admin group
az aks show -g myRG -n myAKS --query "aadProfile.adminGroupObjectIds"

# Fix: Correct the federated credential subject, add the RBAC role assignment,
# or switch to self-hosted runners for private clusters.

Scenario 2: "Helm deploy timed out in pipeline"

Symptom: The helm upgrade --install --wait command times out after 5 minutes. The pipeline fails but the cluster has partial resources deployed.

bash
# Step 1: Check what Helm actually deployed
helm status myapp -n production
helm get manifest myapp -n production | head -50

# Step 2: Check pod status
kubectl get pods -n production -l app=myapp
# Look for: ImagePullBackOff, CrashLoopBackOff, Pending

# Step 3: If ImagePullBackOff → image tag doesn't exist in ACR
az acr repository show-tags -n myacr --repository myapp --top 5
kubectl describe pod <POD_NAME> -n production | grep -A5 "Events"

# Step 4: If CrashLoopBackOff → check app logs
kubectl logs deployment/myapp -n production --previous

# Step 5: If Pending → not enough resources or node affinity not met
kubectl describe pod <POD_NAME> -n production | grep -A10 "Events"
# Look for "FailedScheduling: Insufficient cpu/memory"

# Step 6: If health probes fail → the new version starts but readiness probe fails
kubectl describe pod <POD_NAME> -n production | grep -A10 "Conditions"
# Check readiness probe config in the Helm values

# Step 7: Roll back the failed release
helm rollback myapp 0 -n production

# Fix: Correct the image tag, fix resource requests/limits, check probes,
# verify resource quotas in the namespace.

Scenario 3: "Flux not reconciling — cluster state is stale"

Symptom: You merged a change to the GitOps repo 30 minutes ago, but the cluster still shows the old version. Flux doesn't seem to be syncing.

bash
# Step 1: Check Flux controllers are running
kubectl get pods -n flux-system
# Should see: source-controller, kustomize-controller, helm-controller, notification-controller

# Step 2: Check the GitRepository source status
kubectl get gitrepository -n flux-system
kubectl describe gitrepository myapp-config -n flux-system
# Look for: "Ready: False" and the reason (auth failure, branch not found, timeout)

# Step 3: If auth failure — check the SSH key or token
kubectl get secret flux-system -n flux-system
# Verify the deploy key is added to the Git repo

# Step 4: Check Kustomization status
kubectl get kustomization -n flux-system
kubectl describe kustomization apps -n flux-system
# Look for: "Reconciliation failed: ..." — might be a YAML syntax error

# Step 5: Check HelmRelease status
kubectl get helmrelease -A
kubectl describe helmrelease myapp -n production
# Look for: "upgrade retries exhausted" or "install failed"
# Common: chart version not found, values schema validation failure

# Step 6: Force reconciliation
flux reconcile source git myapp-config
flux reconcile kustomization apps
flux reconcile helmrelease myapp -n production

# Step 7: Check Flux logs
kubectl logs deployment/source-controller -n flux-system --tail=50
kubectl logs deployment/helm-controller -n flux-system --tail=50

# Fix: Correct Git auth, fix YAML syntax, verify chart version exists,
# check HelmRelease values match the chart's values schema.

🎯 Interview Questions

Beginner

Q: What is CI/CD and why is it important for AKS deployments?

CI (Continuous Integration) automatically builds and tests code on every commit. CD (Continuous Delivery/Deployment) automatically deploys the tested artifact to environments. For AKS, CI/CD is critical because containerized deployments involve multiple steps (build image, push to registry, update manifests, deploy to cluster) that are error-prone when done manually. Automation ensures consistency, speed, auditability, and easy rollback.

Q: What is the difference between Azure DevOps Pipelines and GitHub Actions for AKS?

Both are CI/CD platforms that can deploy to AKS. Azure DevOps offers native Azure integration with service connections, environment approvals, and built-in AKS/Helm tasks — preferred in enterprises already in the Azure ecosystem. GitHub Actions offers OIDC federation with Azure, marketplace actions, and tight integration with GitHub repos — preferred for open-source or GitHub-centric teams. The deployment commands are nearly identical; the choice is usually organizational preference.

Q: What role does ACR play in an AKS CI/CD pipeline?

ACR (Azure Container Registry) serves as the artifact repository in the pipeline. CI builds a Docker image and pushes it to ACR. CD pulls the image from ACR to AKS. ACR integrates with AKS via managed identity (no image pull secrets needed), supports geo-replication for multi-region deployments, content trust for signed images, and vulnerability scanning via Defender for Containers. It's the bridge between the build and deploy stages.

Q: What is a Helm rollback and when would you use it?

helm rollback <release> <revision> reverts a Helm release to a previous revision. Helm stores the history of each release, including the manifests and values used. You use it when a new deployment causes issues — crashing pods, broken functionality, performance degradation. Rollback restores the Kubernetes resources to the previous state. Limitation: it doesn't revert external state like database migrations or config store changes.

Q: What is OIDC federation and why is it used in GitHub Actions for Azure?

OIDC (OpenID Connect) federation allows GitHub Actions to authenticate to Azure without storing long-lived secrets. Instead, GitHub generates a short-lived OIDC token for each workflow run, which Azure AD exchanges for an Azure access token based on a federated credential trust. Benefits: no client secrets to rotate, tokens are scoped to specific repos and branches, automatic expiry, and better security posture. It's the recommended auth method for GitHub Actions to Azure.

Intermediate

Q: What is GitOps with Flux v2 and how does it differ from traditional CI/CD push-based deployments?

In traditional push-based CI/CD, the pipeline has cluster credentials and pushes changes (kubectl apply, helm upgrade). In GitOps with Flux, the cluster pulls its desired state from a Git repository. Flux controllers running inside the cluster continuously reconcile: they watch the Git repo for changes, and when a commit is detected, they apply the manifests. Benefits: Git is the single source of truth, full audit trail via Git history, easy rollback via git revert, no external tool needs cluster credentials, and drift detection (manual changes are auto-corrected).

Q: How would you implement a canary deployment strategy on AKS?

Several approaches: (1) NGINX Ingress annotations: deploy the canary as a separate Deployment + Service, create a canary Ingress with nginx.ingress.kubernetes.io/canary: "true" and canary-weight: "10" to route 10% of traffic. Gradually increase weight. (2) Flagger (CNCF project): automates canary analysis — deploys canary, monitors metrics (success rate, latency), and auto-promotes or rolls back. (3) Azure Application Gateway: use weighted backend pools. (4) Service mesh (Istio/Linkerd): VirtualService traffic splitting. NGINX Ingress annotations are the simplest for AKS.

Q: How do you promote a release from staging to production using the same Helm chart?

Create environment-specific values files: values-staging.yaml and values-production.yaml. The chart is the same but values differ (replicas, resource limits, feature flags, ingress hosts). In the pipeline: (1) Deploy to staging with helm upgrade -f values-staging.yaml. (2) Run integration/smoke tests. (3) After approval gate, deploy to production with helm upgrade -f values-production.yaml using the same image tag. The image is built once, stored in ACR, and reused — never rebuilt for different environments.

Q: What pre-deploy validation steps should a CI/CD pipeline include for Helm-based AKS deployments?

Five validation layers: (1) helm lint — catches chart structure and template errors. (2) helm template | kubeconform — validates rendered YAML against Kubernetes schema for the target K8s version. (3) helm upgrade --dry-run against the actual cluster — catches RBAC issues, resource quotas, and admission webhook rejections. (4) kubectl apply --dry-run=server — server-side validation including OPA/Gatekeeper policies. (5) Custom checks: verify image exists in ACR, check for breaking schema changes, run conftest policies.

Q: How do you handle database migrations in a Helm-based CI/CD pipeline?

Use Helm pre-upgrade hooks with a Job that runs migration scripts before the main deployment. The hook Job runs to completion (or fails and blocks the deploy). Key considerations: (1) Migrations must be forward-compatible — old code should still work with the new schema during rolling updates. (2) Use a migration tool that supports idempotent scripts (e.g., Flyway, Liquibase). (3) Keep rollback migrations ready in case of issues. (4) For zero-downtime: use expand-contract pattern — add new columns first (expand), deploy new code, then remove old columns later (contract).

Scenario-Based

Q: Your production deployment via GitHub Actions fails because the AKS cluster is private (no public API endpoint). How do you solve this?

Three options: (1) Self-hosted runners: deploy a GitHub Actions runner on a VM (or AKS itself) within the cluster's VNet — it can reach the private API endpoint. (2) az aks command invoke: use the Azure CLI command invoke feature from a GitHub-hosted runner: az aks command invoke -g myRG -n myAKS --command "helm upgrade ..." — this tunnels through ARM without direct API access. (3) Azure DevOps with private endpoint agents: use VMSS agent pools connected to the VNet. Option 1 (self-hosted runner) is most common for production. Option 2 is good for quick fixes but has limitations on file context.

Q: A developer accidentally ran kubectl apply directly in production, overwriting a Flux-managed resource. What happens and how do you prevent it?

Flux detects the drift on its next reconciliation cycle (default: 5 minutes) and overwrites the manual change with the desired state from Git. The cluster self-heals. To prevent manual changes: (1) Restrict kubectl write access in production — only give developers view ClusterRole via RBAC. (2) The Flux service account should be the only entity with write access to production namespaces. (3) Enable Azure AD + RBAC with minimal permissions. (4) Use Azure Policy to audit/deny changes not from the Flux service account. (5) Educate the team: "If it's not in Git, it doesn't exist."

Q: Your canary deployment is receiving 10% of traffic. Monitoring shows the canary's error rate is 5x higher than stable. Walk through your response process.

1. Immediately reduce canary traffic to 0%: update the NGINX Ingress annotation canary-weight: "0" or delete the canary Ingress entirely. 2. Check canary pod logs: kubectl logs -l version=canary -n production. 3. Check Application Insights for the canary's error details — which endpoint, what exception? 4. Compare canary and stable resource usage: kubectl top pods -l version=canary. 5. If it's a code bug: fix in development, run through CI, push a new canary. 6. If it's config/env: check values differences between canary and stable. 7. Post-mortem: document why the canary failed and what monitoring caught it — validate that canary strategy is working as designed (it prevented a full production outage).

Q: You need to deploy the same application to 3 AKS clusters in different Azure regions. How do you design the CI/CD pipeline?

Design: (1) Single CI stage: build image once, push to ACR with geo-replication enabled (image auto-replicates to all regions). (2) Parallel CD stages: deploy to all 3 clusters simultaneously using the same chart and image tag, but region-specific values (values-eastus.yaml, values-westeu.yaml, values-seasia.yaml) for region-specific Ingress hosts, resource counts, etc. (3) Use GitHub Environments with matrix strategy: strategy: matrix: region: [eastus, westeu, seasia]. (4) Approval gates per region if needed. (5) For GitOps: use Flux with cluster-specific Kustomization overlays — same base manifests, region-specific patches.

Q: Your Flux HelmRelease shows "upgrade retries exhausted" and the cluster is stuck on an old version. How do you recover?

1. kubectl describe helmrelease myapp -n production — read the failure reason (chart not found, values validation, install error). 2. Check Helm release status: helm list -n production — it might show "failed" or "pending-upgrade". 3. If Helm release is in a bad state: helm rollback myapp 0 -n production to restore a known-good state. 4. Fix the root cause in the GitOps repo (correct chart version, fix values, update image tag). 5. kubectl annotate helmrelease myapp -n production reconcile.fluxcd.io/requestedAt="$(date +%s)" to force reconciliation. 6. If still stuck, suspend and resume: flux suspend helmrelease myapp -n production then flux resume helmrelease myapp -n production. 7. In extreme cases, delete the HelmRelease CR and let Flux recreate it from Git.

🌍 Real-World Use Case

Full GitOps Pipeline at a Fintech Company

A fintech company processing 50,000 transactions/hour runs its platform on AKS across 3 Azure regions. They moved from manual deployments (averaging 2 incidents per release) to full GitOps in 4 months.

Result: deployment frequency increased from weekly to 15+ deploys/day. Incidents per release dropped from 2 to 0.05 (1 incident per 20 releases). Mean time to recover reduced from 25 minutes to 3 minutes. Full audit trail satisfies PCI-DSS requirement 6.5.3 (change control).

📝 Summary

← Back to AKS Course