AKS in CI/CD
Build automated deployment pipelines for AKS using Azure DevOps, GitHub Actions, and GitOps with Flux v2 — covering build, push, deploy, rollback, and promotion strategies across environments.
🧒 Simple Explanation (ELI5)
CI/CD with AKS is like a factory assembly line:
- Raw materials go in (a developer pushes code to Git).
- Quality inspection (CI runs tests, linting, security scans).
- Assembly (build a Docker image and push it to ACR).
- Packaging (update the Helm chart with the new image tag).
- Delivery (deploy to AKS — first to staging, then to production after approval).
- Recall system (if something breaks, rollback to the previous version automatically).
The beauty is that no human touches the assembly line after the developer pushes code. Everything is automated, repeatable, and auditable. Push code → tested container deployed to production in minutes.
🔧 Technical Explanation
1. Azure DevOps Pipelines for AKS
Azure DevOps provides native AKS integration through service connections and built-in tasks.
# azure-pipelines.yml — Build + Deploy to AKS via Helm
trigger:
branches:
include:
- main
variables:
acrName: 'myacr'
acrLoginServer: 'myacr.azurecr.io'
imageRepository: 'myapp'
tag: '$(Build.BuildId)'
aksResourceGroup: 'myRG'
aksClusterName: 'myAKS'
helmChartPath: './charts/myapp'
namespace: 'production'
stages:
- stage: Build
displayName: 'Build & Push Image'
jobs:
- job: BuildImage
pool:
vmImage: 'ubuntu-latest'
steps:
- task: Docker@2
displayName: 'Build and Push to ACR'
inputs:
containerRegistry: 'acr-service-connection'
repository: '$(imageRepository)'
command: 'buildAndPush'
Dockerfile: '**/Dockerfile'
tags: |
$(tag)
latest
- stage: Deploy
displayName: 'Deploy to AKS'
dependsOn: Build
jobs:
- deployment: DeployHelm
environment: 'production'
pool:
vmImage: 'ubuntu-latest'
strategy:
runOnce:
deploy:
steps:
- task: HelmInstaller@0
displayName: 'Install Helm'
inputs:
helmVersion: '3.14.0'
- task: AzureCLI@2
displayName: 'Deploy via Helm'
inputs:
azureSubscription: 'azure-service-connection'
scriptType: 'bash'
scriptLocation: 'inlineScript'
inlineScript: |
az aks get-credentials -g $(aksResourceGroup) -n $(aksClusterName)
helm upgrade --install myapp $(helmChartPath) \
--namespace $(namespace) \
--create-namespace \
--set image.repository=$(acrLoginServer)/$(imageRepository) \
--set image.tag=$(tag) \
--wait --timeout 5m2. GitHub Actions for AKS
GitHub Actions uses OIDC federation with Azure (workload identity federation) — no secrets stored in GitHub.
# .github/workflows/deploy-aks.yml
name: Build and Deploy to AKS
on:
push:
branches: [main]
permissions:
id-token: write # Required for OIDC
contents: read
env:
ACR_NAME: myacr
ACR_LOGIN_SERVER: myacr.azurecr.io
IMAGE_NAME: myapp
AKS_RESOURCE_GROUP: myRG
AKS_CLUSTER_NAME: myAKS
HELM_CHART_PATH: ./charts/myapp
NAMESPACE: production
jobs:
build-and-push:
name: Build & Push to ACR
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.version }}
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Azure Login (OIDC)
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: ACR Login
run: az acr login --name ${{ env.ACR_NAME }}
- name: Docker meta
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=
type=ref,event=branch
- name: Build and Push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
deploy:
name: Deploy to AKS
needs: build-and-push
runs-on: ubuntu-latest
environment: production # GitHub Environment for approvals
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Azure Login (OIDC)
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Set AKS context
uses: azure/aks-set-context@v4
with:
resource-group: ${{ env.AKS_RESOURCE_GROUP }}
cluster-name: ${{ env.AKS_CLUSTER_NAME }}
- name: Setup Helm
uses: azure/setup-helm@v4
with:
version: '3.14.0'
- name: Helm Lint
run: helm lint ${{ env.HELM_CHART_PATH }}
- name: Helm Dry-Run
run: |
helm upgrade --install myapp ${{ env.HELM_CHART_PATH }} \
--namespace ${{ env.NAMESPACE }} \
--set image.repository=${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }} \
--set image.tag=${{ needs.build-and-push.outputs.image-tag }} \
--dry-run
- name: Deploy with Helm
run: |
helm upgrade --install myapp ${{ env.HELM_CHART_PATH }} \
--namespace ${{ env.NAMESPACE }} \
--create-namespace \
--set image.repository=${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }} \
--set image.tag=${{ needs.build-and-push.outputs.image-tag }} \
--wait --timeout 5m
- name: Run Helm Tests
run: helm test myapp --namespace ${{ env.NAMESPACE }}
- name: Verify Deployment
run: |
kubectl rollout status deployment/myapp -n ${{ env.NAMESPACE }} --timeout=3m
kubectl get pods -n ${{ env.NAMESPACE }} -l app=myappSet up OIDC federation between GitHub and Azure: create an Azure AD app registration, add a federated credential for your GitHub repo/branch, and store the client ID, tenant ID, and subscription ID as GitHub secrets. This eliminates the need for long-lived client secrets — tokens are short-lived and scoped to each workflow run.
3. Deployment Flow
4. GitOps with Flux v2
GitOps flips the deployment model: instead of a pipeline pushing to the cluster, Flux pulls desired state from Git and reconciles the cluster to match.
# Enable GitOps (Flux v2) on AKS az aks update -g myRG -n myAKS --enable-gitops # Create a Flux configuration pointing to your Git repo az k8s-configuration flux create \ --resource-group myRG \ --cluster-name myAKS \ --cluster-type managedClusters \ --name myapp-config \ --namespace flux-system \ --scope cluster \ --url https://github.com/myorg/aks-gitops-config \ --branch main \ --kustomization name=infra path=./infrastructure prune=true \ --kustomization name=apps path=./apps/production prune=true dependsOn=infra
# Example GitOps repo structure
# aks-gitops-config/
# ├── infrastructure/
# │ ├── namespaces.yaml
# │ ├── network-policies.yaml
# │ └── rbac.yaml
# └── apps/
# ├── production/
# │ ├── myapp-helmrelease.yaml
# │ └── kustomization.yaml
# └── staging/
# ├── myapp-helmrelease.yaml
# └── kustomization.yaml
---
# apps/production/myapp-helmrelease.yaml
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
name: myapp
namespace: production
spec:
interval: 5m
chart:
spec:
chart: myapp
version: "1.2.x" # Semver range — auto-upgrade patches
sourceRef:
kind: HelmRepository
name: myacr-charts
namespace: flux-system
values:
image:
repository: myacr.azurecr.io/myapp
tag: "abc1234" # Updated by CI pipeline or image automation
replicaCount: 3
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
upgrade:
remediation:
retries: 3
rollback:
cleanupOnFail: true# HelmRepository source for ACR-hosted charts apiVersion: source.toolkit.fluxcd.io/v1beta2 kind: HelmRepository metadata: name: myacr-charts namespace: flux-system spec: interval: 10m url: oci://myacr.azurecr.io/helm type: oci
With GitOps, no one runs kubectl apply directly in production. All changes go through Git (pull request → review → merge). Flux reconciles the cluster automatically. Manual changes are overwritten on the next reconciliation cycle. This provides a complete audit trail and easy rollback via git revert.
5. Blue-Green & Canary Deployments
Kubernetes and AKS provide several strategies beyond the default rolling update:
Canary with NGINX Ingress
# Main Ingress — routes to stable version
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-stable
annotations:
nginx.ingress.kubernetes.io/canary: "false"
spec:
ingressClassName: nginx
rules:
- host: myapp.company.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-stable
port:
number: 80
---
# Canary Ingress — routes 10% of traffic to new version
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-canary
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
ingressClassName: nginx
rules:
- host: myapp.company.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: myapp-canary
port:
number: 80Blue-Green with Service Swap
# Deploy "green" alongside existing "blue"
helm install myapp-green ./charts/myapp \
--namespace production \
--set image.tag=v2.0 \
--set service.name=myapp-green
# Run smoke tests against the green service
kubectl run smoke-test --rm -it --image=curlimages/curl -- \
curl -s http://myapp-green.production.svc/health
# Swap the production Service selector to point to green pods
kubectl patch service myapp -n production \
-p '{"spec":{"selector":{"version":"green"}}}'
# If issues arise — swap back to blue
kubectl patch service myapp -n production \
-p '{"spec":{"selector":{"version":"blue"}}}'
# Clean up the old blue deployment after validation
helm uninstall myapp-blue -n production6. Environment Promotion
Use the same Helm chart across environments — only change values:
# Same chart, different values per environment helm upgrade --install myapp ./charts/myapp \ -f values-dev.yaml \ --namespace dev helm upgrade --install myapp ./charts/myapp \ -f values-staging.yaml \ --namespace staging helm upgrade --install myapp ./charts/myapp \ -f values-production.yaml \ --namespace production
# values-dev.yaml
replicaCount: 1
image:
tag: "latest"
resources:
requests:
cpu: 100m
memory: 128Mi
# values-production.yaml
replicaCount: 5
image:
tag: "v1.8.3" # Pinned, immutable tag
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "1"
memory: "1Gi"
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 207. Service Connections & Authentication
| Platform | Auth Method | Configuration |
|---|---|---|
| Azure DevOps | Service Connection (SPN or Managed Identity) | Project Settings → Service connections → Azure Resource Manager |
| GitHub Actions | OIDC Workload Identity Federation | Azure AD App Registration + Federated Credential for repo/branch |
| Self-hosted agents | Managed Identity on agent VM/VMSS | Assign identity → az role assignment → no secrets needed |
# Set up OIDC for GitHub Actions
# 1. Create an app registration
APP_ID=$(az ad app create --display-name "github-aks-deploy" --query appId -o tsv)
# 2. Create a service principal
az ad sp create --id "$APP_ID"
# 3. Add federated credential for the GitHub repo
az ad app federated-credential create --id "$APP_ID" --parameters '{
"name": "github-main-branch",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:myorg/myapp:ref:refs/heads/main",
"audiences": ["api://AzureADTokenExchange"]
}'
# 4. Grant AKS and ACR access
AKS_ID=$(az aks show -g myRG -n myAKS --query id -o tsv)
ACR_ID=$(az acr show -n myacr --query id -o tsv)
az role assignment create --assignee "$APP_ID" --role "Azure Kubernetes Service Cluster User Role" --scope "$AKS_ID"
az role assignment create --assignee "$APP_ID" --role "AcrPush" --scope "$ACR_ID"
# 5. Store in GitHub Secrets: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID8. Rollback Strategies
# Helm rollback — revert to previous release helm history myapp -n production helm rollback myapp 3 -n production # Roll back to revision 3 helm rollback myapp 0 -n production # Roll back to the last successful release # kubectl rollout undo — revert a Deployment kubectl rollout undo deployment/myapp -n production kubectl rollout undo deployment/myapp -n production --to-revision=5 # Check rollout status kubectl rollout status deployment/myapp -n production # Flux GitOps rollback — revert the Git commit git revert HEAD # Revert the last commit git push origin main # Flux detects the change and reconciles — rolls back the cluster automatically
Helm rollback reverts the Kubernetes manifests but does NOT revert database migrations, config changes in Key Vault, or external service state. Always design your deployments to be backward-compatible with the previous schema/config, or use separate migration pipelines with rollback support.
9. Pre-deploy Validation
# 1. Lint the chart helm lint ./charts/myapp -f values-production.yaml # 2. Template render — catch templating errors helm template myapp ./charts/myapp -f values-production.yaml > /tmp/rendered.yaml # 3. Validate against Kubernetes schema (kubeconform) helm template myapp ./charts/myapp -f values-production.yaml | \ kubeconform -strict -kubernetes-version 1.28.0 # 4. Dry-run against the actual cluster (catches RBAC, quota, admission issues) helm upgrade --install myapp ./charts/myapp \ -f values-production.yaml \ -n production \ --dry-run # 5. Policy check — ensure the manifest passes all Gatekeeper constraints helm template myapp ./charts/myapp -f values-production.yaml | \ kubectl apply --dry-run=server -f -
⌨️ Hands-on
Lab: Complete GitHub Actions Workflow for AKS
This lab builds a full pipeline: checkout → build → push to ACR → deploy with Helm → verify.
# Prerequisites: Set up OIDC federation (see section 7 above) # Then create the workflow file: mkdir -p .github/workflows
# .github/workflows/aks-full-pipeline.yml
name: Full AKS CI/CD Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
permissions:
id-token: write
contents: read
env:
ACR_NAME: myacr
ACR_LOGIN_SERVER: myacr.azurecr.io
IMAGE_NAME: myapp
AKS_RG: myRG
AKS_NAME: myAKS
jobs:
# ──── CI: Build, Test, Scan ────
ci:
name: CI - Build & Test
runs-on: ubuntu-latest
outputs:
image-tag: ${{ github.sha }}
steps:
- uses: actions/checkout@v4
- name: Run unit tests
run: |
npm ci
npm test
- name: Azure Login
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: ACR Login
run: az acr login --name ${{ env.ACR_NAME }}
- name: Build and Push Image
run: |
docker build -t ${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }}:${{ github.sha }} .
docker push ${{ env.ACR_LOGIN_SERVER }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
- name: Scan Image for Vulnerabilities
run: |
az acr repository show-tags -n ${{ env.ACR_NAME }} --repository ${{ env.IMAGE_NAME }} --top 1
echo "Image pushed. Defender for Containers will scan automatically."
# ──── Validate: Lint & Dry-Run ────
validate:
name: Validate Helm Chart
needs: ci
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Helm
uses: azure/setup-helm@v4
- name: Helm Lint
run: helm lint ./charts/myapp -f ./charts/myapp/values-production.yaml
- name: Template & Schema Validate
run: |
helm template myapp ./charts/myapp \
-f ./charts/myapp/values-production.yaml \
--set image.tag=${{ needs.ci.outputs.image-tag }} \
> /tmp/rendered.yaml
# Install kubeconform
curl -sSLo /tmp/kubeconform.tar.gz \
https://github.com/yannh/kubeconform/releases/latest/download/kubeconform-linux-amd64.tar.gz
tar xf /tmp/kubeconform.tar.gz -C /usr/local/bin
kubeconform -strict -kubernetes-version 1.28.0 /tmp/rendered.yaml
# ──── Deploy to Staging ────
deploy-staging:
name: Deploy to Staging
needs: [ci, validate]
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- name: Azure Login
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- uses: azure/aks-set-context@v4
with:
resource-group: ${{ env.AKS_RG }}
cluster-name: ${{ env.AKS_NAME }}
- uses: azure/setup-helm@v4
- name: Deploy to staging
run: |
helm upgrade --install myapp ./charts/myapp \
-f ./charts/myapp/values-staging.yaml \
--namespace staging \
--create-namespace \
--set image.tag=${{ needs.ci.outputs.image-tag }} \
--wait --timeout 5m
- name: Smoke test
run: |
kubectl rollout status deployment/myapp -n staging --timeout=3m
STAGING_IP=$(kubectl get svc myapp -n staging -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
curl -sf "http://$STAGING_IP/health" || exit 1
# ──── Deploy to Production (with approval) ────
deploy-production:
name: Deploy to Production
needs: [ci, deploy-staging]
runs-on: ubuntu-latest
environment: production # Requires manual approval in GitHub
steps:
- uses: actions/checkout@v4
- name: Azure Login
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- uses: azure/aks-set-context@v4
with:
resource-group: ${{ env.AKS_RG }}
cluster-name: ${{ env.AKS_NAME }}
- uses: azure/setup-helm@v4
- name: Deploy to production
run: |
helm upgrade --install myapp ./charts/myapp \
-f ./charts/myapp/values-production.yaml \
--namespace production \
--create-namespace \
--set image.tag=${{ needs.ci.outputs.image-tag }} \
--wait --timeout 10m
- name: Run Helm Tests
run: helm test myapp --namespace production
- name: Verify rollout
run: |
kubectl rollout status deployment/myapp -n production --timeout=5m
kubectl get pods -n production -l app=myapp
echo "✅ Deployment successful: ${{ needs.ci.outputs.image-tag }}"Configure the production environment in GitHub Settings → Environments with required reviewers. The pipeline pauses at the deploy-production job until an approved team member clicks "Approve." This provides a manual gate without breaking automation.
🐛 Debugging Scenarios
Scenario 1: "Pipeline can't connect to AKS"
Symptom: The GitHub Actions workflow fails at azure/aks-set-context with "Unable to connect to the server" or a 401 Unauthorized error.
# Step 1: Verify OIDC credentials are correct # Check GitHub Secrets: AZURE_CLIENT_ID, AZURE_TENANT_ID, AZURE_SUBSCRIPTION_ID # Must match the app registration # Step 2: Check the federated credential configuration az ad app federated-credential list --id "<APP_ID>" -o table # Verify: issuer = "https://token.actions.githubusercontent.com" # subject = "repo:myorg/myapp:ref:refs/heads/main" (match your repo/branch) # Step 3: Check RBAC on the AKS resource az role assignment list --assignee "<APP_ID>" --scope "/subscriptions/.../managedClusters/myAKS" -o table # Must have "Azure Kubernetes Service Cluster User Role" or "Contributor" # Step 4: Private cluster? GitHub-hosted runners can't reach private API endpoints az aks show -g myRG -n myAKS --query "apiServerAccessProfile" # If enablePrivateCluster=true → use self-hosted runners on a VM in the VNet # Or use "command invoke": az aks command invoke -g myRG -n myAKS --command "kubectl get nodes" # Step 5: Check if Azure AD integration requires the app to be in an admin group az aks show -g myRG -n myAKS --query "aadProfile.adminGroupObjectIds" # Fix: Correct the federated credential subject, add the RBAC role assignment, # or switch to self-hosted runners for private clusters.
Scenario 2: "Helm deploy timed out in pipeline"
Symptom: The helm upgrade --install --wait command times out after 5 minutes. The pipeline fails but the cluster has partial resources deployed.
# Step 1: Check what Helm actually deployed helm status myapp -n production helm get manifest myapp -n production | head -50 # Step 2: Check pod status kubectl get pods -n production -l app=myapp # Look for: ImagePullBackOff, CrashLoopBackOff, Pending # Step 3: If ImagePullBackOff → image tag doesn't exist in ACR az acr repository show-tags -n myacr --repository myapp --top 5 kubectl describe pod <POD_NAME> -n production | grep -A5 "Events" # Step 4: If CrashLoopBackOff → check app logs kubectl logs deployment/myapp -n production --previous # Step 5: If Pending → not enough resources or node affinity not met kubectl describe pod <POD_NAME> -n production | grep -A10 "Events" # Look for "FailedScheduling: Insufficient cpu/memory" # Step 6: If health probes fail → the new version starts but readiness probe fails kubectl describe pod <POD_NAME> -n production | grep -A10 "Conditions" # Check readiness probe config in the Helm values # Step 7: Roll back the failed release helm rollback myapp 0 -n production # Fix: Correct the image tag, fix resource requests/limits, check probes, # verify resource quotas in the namespace.
Scenario 3: "Flux not reconciling — cluster state is stale"
Symptom: You merged a change to the GitOps repo 30 minutes ago, but the cluster still shows the old version. Flux doesn't seem to be syncing.
# Step 1: Check Flux controllers are running kubectl get pods -n flux-system # Should see: source-controller, kustomize-controller, helm-controller, notification-controller # Step 2: Check the GitRepository source status kubectl get gitrepository -n flux-system kubectl describe gitrepository myapp-config -n flux-system # Look for: "Ready: False" and the reason (auth failure, branch not found, timeout) # Step 3: If auth failure — check the SSH key or token kubectl get secret flux-system -n flux-system # Verify the deploy key is added to the Git repo # Step 4: Check Kustomization status kubectl get kustomization -n flux-system kubectl describe kustomization apps -n flux-system # Look for: "Reconciliation failed: ..." — might be a YAML syntax error # Step 5: Check HelmRelease status kubectl get helmrelease -A kubectl describe helmrelease myapp -n production # Look for: "upgrade retries exhausted" or "install failed" # Common: chart version not found, values schema validation failure # Step 6: Force reconciliation flux reconcile source git myapp-config flux reconcile kustomization apps flux reconcile helmrelease myapp -n production # Step 7: Check Flux logs kubectl logs deployment/source-controller -n flux-system --tail=50 kubectl logs deployment/helm-controller -n flux-system --tail=50 # Fix: Correct Git auth, fix YAML syntax, verify chart version exists, # check HelmRelease values match the chart's values schema.
🎯 Interview Questions
Beginner
CI (Continuous Integration) automatically builds and tests code on every commit. CD (Continuous Delivery/Deployment) automatically deploys the tested artifact to environments. For AKS, CI/CD is critical because containerized deployments involve multiple steps (build image, push to registry, update manifests, deploy to cluster) that are error-prone when done manually. Automation ensures consistency, speed, auditability, and easy rollback.
Both are CI/CD platforms that can deploy to AKS. Azure DevOps offers native Azure integration with service connections, environment approvals, and built-in AKS/Helm tasks — preferred in enterprises already in the Azure ecosystem. GitHub Actions offers OIDC federation with Azure, marketplace actions, and tight integration with GitHub repos — preferred for open-source or GitHub-centric teams. The deployment commands are nearly identical; the choice is usually organizational preference.
ACR (Azure Container Registry) serves as the artifact repository in the pipeline. CI builds a Docker image and pushes it to ACR. CD pulls the image from ACR to AKS. ACR integrates with AKS via managed identity (no image pull secrets needed), supports geo-replication for multi-region deployments, content trust for signed images, and vulnerability scanning via Defender for Containers. It's the bridge between the build and deploy stages.
helm rollback <release> <revision> reverts a Helm release to a previous revision. Helm stores the history of each release, including the manifests and values used. You use it when a new deployment causes issues — crashing pods, broken functionality, performance degradation. Rollback restores the Kubernetes resources to the previous state. Limitation: it doesn't revert external state like database migrations or config store changes.
OIDC (OpenID Connect) federation allows GitHub Actions to authenticate to Azure without storing long-lived secrets. Instead, GitHub generates a short-lived OIDC token for each workflow run, which Azure AD exchanges for an Azure access token based on a federated credential trust. Benefits: no client secrets to rotate, tokens are scoped to specific repos and branches, automatic expiry, and better security posture. It's the recommended auth method for GitHub Actions to Azure.
Intermediate
In traditional push-based CI/CD, the pipeline has cluster credentials and pushes changes (kubectl apply, helm upgrade). In GitOps with Flux, the cluster pulls its desired state from a Git repository. Flux controllers running inside the cluster continuously reconcile: they watch the Git repo for changes, and when a commit is detected, they apply the manifests. Benefits: Git is the single source of truth, full audit trail via Git history, easy rollback via git revert, no external tool needs cluster credentials, and drift detection (manual changes are auto-corrected).
Several approaches: (1) NGINX Ingress annotations: deploy the canary as a separate Deployment + Service, create a canary Ingress with nginx.ingress.kubernetes.io/canary: "true" and canary-weight: "10" to route 10% of traffic. Gradually increase weight. (2) Flagger (CNCF project): automates canary analysis — deploys canary, monitors metrics (success rate, latency), and auto-promotes or rolls back. (3) Azure Application Gateway: use weighted backend pools. (4) Service mesh (Istio/Linkerd): VirtualService traffic splitting. NGINX Ingress annotations are the simplest for AKS.
Create environment-specific values files: values-staging.yaml and values-production.yaml. The chart is the same but values differ (replicas, resource limits, feature flags, ingress hosts). In the pipeline: (1) Deploy to staging with helm upgrade -f values-staging.yaml. (2) Run integration/smoke tests. (3) After approval gate, deploy to production with helm upgrade -f values-production.yaml using the same image tag. The image is built once, stored in ACR, and reused — never rebuilt for different environments.
Five validation layers: (1) helm lint — catches chart structure and template errors. (2) helm template | kubeconform — validates rendered YAML against Kubernetes schema for the target K8s version. (3) helm upgrade --dry-run against the actual cluster — catches RBAC issues, resource quotas, and admission webhook rejections. (4) kubectl apply --dry-run=server — server-side validation including OPA/Gatekeeper policies. (5) Custom checks: verify image exists in ACR, check for breaking schema changes, run conftest policies.
Use Helm pre-upgrade hooks with a Job that runs migration scripts before the main deployment. The hook Job runs to completion (or fails and blocks the deploy). Key considerations: (1) Migrations must be forward-compatible — old code should still work with the new schema during rolling updates. (2) Use a migration tool that supports idempotent scripts (e.g., Flyway, Liquibase). (3) Keep rollback migrations ready in case of issues. (4) For zero-downtime: use expand-contract pattern — add new columns first (expand), deploy new code, then remove old columns later (contract).
Scenario-Based
Three options: (1) Self-hosted runners: deploy a GitHub Actions runner on a VM (or AKS itself) within the cluster's VNet — it can reach the private API endpoint. (2) az aks command invoke: use the Azure CLI command invoke feature from a GitHub-hosted runner: az aks command invoke -g myRG -n myAKS --command "helm upgrade ..." — this tunnels through ARM without direct API access. (3) Azure DevOps with private endpoint agents: use VMSS agent pools connected to the VNet. Option 1 (self-hosted runner) is most common for production. Option 2 is good for quick fixes but has limitations on file context.
Flux detects the drift on its next reconciliation cycle (default: 5 minutes) and overwrites the manual change with the desired state from Git. The cluster self-heals. To prevent manual changes: (1) Restrict kubectl write access in production — only give developers view ClusterRole via RBAC. (2) The Flux service account should be the only entity with write access to production namespaces. (3) Enable Azure AD + RBAC with minimal permissions. (4) Use Azure Policy to audit/deny changes not from the Flux service account. (5) Educate the team: "If it's not in Git, it doesn't exist."
1. Immediately reduce canary traffic to 0%: update the NGINX Ingress annotation canary-weight: "0" or delete the canary Ingress entirely. 2. Check canary pod logs: kubectl logs -l version=canary -n production. 3. Check Application Insights for the canary's error details — which endpoint, what exception? 4. Compare canary and stable resource usage: kubectl top pods -l version=canary. 5. If it's a code bug: fix in development, run through CI, push a new canary. 6. If it's config/env: check values differences between canary and stable. 7. Post-mortem: document why the canary failed and what monitoring caught it — validate that canary strategy is working as designed (it prevented a full production outage).
Design: (1) Single CI stage: build image once, push to ACR with geo-replication enabled (image auto-replicates to all regions). (2) Parallel CD stages: deploy to all 3 clusters simultaneously using the same chart and image tag, but region-specific values (values-eastus.yaml, values-westeu.yaml, values-seasia.yaml) for region-specific Ingress hosts, resource counts, etc. (3) Use GitHub Environments with matrix strategy: strategy: matrix: region: [eastus, westeu, seasia]. (4) Approval gates per region if needed. (5) For GitOps: use Flux with cluster-specific Kustomization overlays — same base manifests, region-specific patches.
1. kubectl describe helmrelease myapp -n production — read the failure reason (chart not found, values validation, install error). 2. Check Helm release status: helm list -n production — it might show "failed" or "pending-upgrade". 3. If Helm release is in a bad state: helm rollback myapp 0 -n production to restore a known-good state. 4. Fix the root cause in the GitOps repo (correct chart version, fix values, update image tag). 5. kubectl annotate helmrelease myapp -n production reconcile.fluxcd.io/requestedAt="$(date +%s)" to force reconciliation. 6. If still stuck, suspend and resume: flux suspend helmrelease myapp -n production then flux resume helmrelease myapp -n production. 7. In extreme cases, delete the HelmRelease CR and let Flux recreate it from Git.
🌍 Real-World Use Case
Full GitOps Pipeline at a Fintech Company
A fintech company processing 50,000 transactions/hour runs its platform on AKS across 3 Azure regions. They moved from manual deployments (averaging 2 incidents per release) to full GitOps in 4 months.
- Two Git repositories: (1) Application repos (code + Dockerfile + Helm chart per microservice). (2) GitOps config repo (Flux Kustomizations, HelmReleases, infrastructure manifests).
- CI (GitHub Actions): On PR merge → build image → push to ACR (geo-replicated) → run kubeconform + conftest policies → update image tag in the GitOps repo via automated PR.
- CD (Flux v2): Flux watches the GitOps repo. When the image tag PR is merged, Flux reconciles: staging first (5-minute interval), production after a 30-minute soak in staging (time-gated Kustomization).
- Canary with Flagger: Flagger automates canary analysis using Prometheus metrics. New versions receive 5% → 25% → 50% → 100% traffic over 30 minutes. If error rate exceeds 1% or p99 latency exceeds 500ms, Flagger auto-rollbacks.
- Rollback:
git revertin the GitOps repo triggers Flux to reconcile back. Average rollback time: 3 minutes (vs 25 minutes before GitOps). - Guardrails: Conftest policies validate every HelmRelease before merge — must have resource limits, health probes, PDB, non-root security context. Azure Policy on clusters blocks anything that slips through.
- Alerting: Flux notification controller sends deployment events to Slack and PagerDuty. Failed reconciliations trigger Sev-2 alerts to the platform team.
Result: deployment frequency increased from weekly to 15+ deploys/day. Incidents per release dropped from 2 to 0.05 (1 incident per 20 releases). Mean time to recover reduced from 25 minutes to 3 minutes. Full audit trail satisfies PCI-DSS requirement 6.5.3 (change control).
📝 Summary
- Azure DevOps and GitHub Actions both provide first-class AKS integration — choose based on organizational preference.
- OIDC federation is the recommended auth method for GitHub Actions → Azure (no long-lived secrets).
- The standard pipeline flow: build → push to ACR → lint/validate → deploy with Helm → verify.
- GitOps with Flux v2 inverts the model — cluster pulls desired state from Git, providing audit trail, drift detection, and easy rollback.
- Canary and blue-green strategies minimize risk — use NGINX Ingress annotations, Flagger, or service mesh for traffic splitting.
- Environment promotion uses the same Helm chart with different values files — one image, multiple environments.
- Pre-deploy validation (lint, template, kubeconform, dry-run, policy check) catches issues before they reach the cluster.
- Rollback strategies:
helm rollbackfor push-based,git revertfor GitOps,kubectl rollout undofor quick fixes. - Always use approval gates for production deployments — automated doesn't mean uncontrolled.
- Design for backward compatibility — database migrations, config changes, and API contracts must support rollback.