Deploy to AKS
Build Docker images, push to Azure Container Registry, and deploy to Azure Kubernetes Service using Helm β the full production pipeline.
π§ Simple Explanation (ELI5)
Imagine you run a delivery company.
- Building the Docker image is like packing your product into a sturdy shipping box β everything the customer needs is sealed inside.
- Tagging the image is printing a shipping label with the exact version number so you always know which box is which.
- Pushing to ACR (Azure Container Registry) is dropping the box off at the warehouse. It's stored safely, ready to be picked up by any delivery truck.
- GitHub Actions is the delivery truck. Every time new code is merged, it automatically drives to the warehouse, picks up the latest box, and heads to the destination.
- AKS (Azure Kubernetes Service) is the customer's address β the live cluster where your application runs for real users.
- Helm is the delivery instructions taped to the box β it tells AKS exactly how to unpack, configure, and run your app.
Put it all together: code is merged β the truck picks up the box from the warehouse β drives it to the address using the delivery instructions β your app is live. No human intervention needed.
π§ Pipeline Overview
The end-to-end deployment pipeline follows a clear, linear flow:
Code Push β Lint & Test β Build Docker Image β Push to ACR β Helm Deploy β AKS Cluster
β
Staging β Smoke Tests β Production (with approval)
Prerequisites
- AKS cluster β a running Kubernetes cluster in Azure
- ACR registry β an Azure Container Registry to store Docker images
- Authentication β a Service Principal or OIDC Federated Credential with permissions to push to ACR and deploy to AKS
- Helm chart β a chart in your repository (e.g.,
./charts/myapp/) that describes your Kubernetes resources
Already know AKS? See our AKS Course for cluster management deep-dives. Need Helm basics? See our Helm Course for charts, templating, and values.
π Azure Authentication
Before your workflow can push images or deploy to AKS, it must authenticate with Azure. There are two primary methods:
Method 1: Service Principal (Client Secret)
Create a Service Principal and store its credentials as GitHub secrets:
AZURE_CLIENT_IDβ the app (client) IDAZURE_CLIENT_SECRETβ the client secretAZURE_TENANT_IDβ your Azure AD tenant IDAZURE_SUBSCRIPTION_IDβ the target subscription
- uses: azure/login@v2
with:
creds: |
{
"clientId": "${{ secrets.AZURE_CLIENT_ID }}",
"clientSecret": "${{ secrets.AZURE_CLIENT_SECRET }}",
"tenantId": "${{ secrets.AZURE_TENANT_ID }}",
"subscriptionId": "${{ secrets.AZURE_SUBSCRIPTION_ID }}"
}
Method 2: OIDC Federated Credentials (Recommended)
OIDC eliminates stored secrets entirely. GitHub's token is exchanged for an Azure access token using a trust relationship configured in Azure AD.
- No client secret to rotate or leak
- Short-lived tokens scoped to the workflow run
- Requires configuring a Federated Credential on the App Registration
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
Getting AKS Credentials
After Azure login, fetch the kubeconfig so kubectl and helm can talk to your cluster:
- run: az aks get-credentials --resource-group ${{ vars.RESOURCE_GROUP }} --cluster-name ${{ vars.CLUSTER_NAME }} --overwrite-existing
Service Principal secrets can be leaked, forgotten, or expire. OIDC federated credentials are short-lived, automatically scoped to the exact workflow, and never stored in GitHub. Use OIDC for all new setups unless your Azure AD version doesn't support it.
π³ Build & Push to ACR
After authenticating, build the Docker image and push it to Azure Container Registry. Tag with both the commit SHA (immutable, traceable) and latest (convenience).
- uses: azure/docker-login@v1
with:
login-server: ${{ vars.ACR_NAME }}.azurecr.io
username: ${{ secrets.ACR_USERNAME }}
password: ${{ secrets.ACR_PASSWORD }}
- uses: docker/build-push-action@v5
with:
push: true
tags: |
${{ vars.ACR_NAME }}.azurecr.io/myapp:${{ github.sha }}
${{ vars.ACR_NAME }}.azurecr.io/myapp:latest
cache-from: type=gha
cache-to: type=gha,mode=max
Using ${{ github.sha }} as the primary tag creates an immutable, auditable link between your Git commit and the deployed image. You can always trace exactly which code is running in production. The latest tag is a convenience alias β never rely on it for production deployments.
β Helm Deploy to AKS
With the image in ACR and kubeconfig configured, use Helm to deploy (or upgrade) the application on AKS:
- uses: azure/aks-set-context@v3
with:
resource-group: ${{ vars.RESOURCE_GROUP }}
cluster-name: ${{ vars.CLUSTER_NAME }}
- run: |
helm upgrade --install myapp ./charts/myapp \
--namespace production \
--create-namespace \
--set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
--set image.tag=${{ github.sha }} \
--set ingress.host=myapp.example.com \
--wait --timeout 5m
| Flag | Purpose |
|---|---|
upgrade --install | Idempotent β installs if new, upgrades if already exists |
--namespace production | Deploy into a dedicated namespace for isolation |
--create-namespace | Auto-create the namespace if it doesn't exist yet |
--set image.tag=$SHA | Pin to the exact commit's Docker image |
--wait | Block until all pods are Ready β catches crash loops early |
--timeout 5m | Fail the step if pods aren't ready within 5 minutes |
π Complete Production Workflow
Here's a full workflow combining lint, test, build, and multi-stage deployment with environments and manual approval:
# .github/workflows/deploy-aks.yml
name: Build & Deploy to AKS
on:
push:
branches: [main]
permissions:
id-token: write
contents: read
jobs:
lint-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm test
build-and-push:
needs: lint-and-test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- uses: azure/docker-login@v1
with:
login-server: ${{ vars.ACR_NAME }}.azurecr.io
username: ${{ secrets.ACR_USERNAME }}
password: ${{ secrets.ACR_PASSWORD }}
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
${{ vars.ACR_NAME }}.azurecr.io/myapp:${{ github.sha }}
${{ vars.ACR_NAME }}.azurecr.io/myapp:latest
cache-from: type=gha
cache-to: type=gha,mode=max
deploy-staging:
needs: build-and-push
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- uses: azure/aks-set-context@v3
with:
resource-group: ${{ vars.RESOURCE_GROUP }}
cluster-name: ${{ vars.CLUSTER_NAME }}
- uses: azure/setup-helm@v3
with:
version: v3.14.0
- run: |
helm upgrade --install myapp ./charts/myapp \
--namespace staging \
--create-namespace \
--set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
--set image.tag=${{ github.sha }} \
--set ingress.host=staging.myapp.example.com \
--atomic --wait --timeout 5m
smoke-test:
needs: deploy-staging
runs-on: ubuntu-latest
steps:
- run: |
echo "Running smoke tests against staging..."
for i in 1 2 3 4 5; do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://staging.myapp.example.com/health)
if [ "$STATUS" = "200" ]; then
echo "Health check passed (attempt $i)"
exit 0
fi
echo "Attempt $i: got $STATUS, retrying in 10s..."
sleep 10
done
echo "Smoke test failed after 5 attempts"
exit 1
deploy-production:
needs: smoke-test
runs-on: ubuntu-latest
environment: production # Requires manual approval
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- uses: azure/aks-set-context@v3
with:
resource-group: ${{ vars.RESOURCE_GROUP }}
cluster-name: ${{ vars.CLUSTER_NAME }}
- uses: azure/setup-helm@v3
with:
version: v3.14.0
- run: |
helm upgrade --install myapp ./charts/myapp \
--namespace production \
--create-namespace \
--set image.repository=${{ vars.ACR_NAME }}.azurecr.io/myapp \
--set image.tag=${{ github.sha }} \
--set ingress.host=myapp.example.com \
--atomic --wait --timeout 5m
π Deployment Strategies
Production deployments need more than just helm upgrade. Choose the right strategy based on your risk tolerance and traffic patterns.
Strategy 1: Rolling Update (Default)
Helm's default β new pods start while old pods terminate. Zero-downtime when configured correctly.
# In your Helm chart's deployment.yaml
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # Never kill old pods before new ones are ready
maxSurge: 1 # Add one extra pod during rollout
template:
spec:
containers:
- name: myapp
readinessProbe: # CRITICAL β traffic only routes when ready
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
lifecycle:
preStop: # Allow in-flight requests to finish
exec:
command: ["sleep", "15"]
Strategy 2: Blue-Green Deployment
Two identical environments β deploy to the inactive one, switch traffic instantly, keep the old one for rollback.
# GitHub Actions workflow β Blue-Green via Helm
deploy-green:
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- uses: azure/aks-set-context@v3
with:
resource-group: ${{ vars.RESOURCE_GROUP }}
cluster-name: ${{ vars.CLUSTER_NAME }}
# Step 1: Deploy to GREEN slot (inactive)
- name: Deploy to green slot
run: |
helm upgrade --install myapp-green ./charts/myapp \
--namespace production \
--set image.tag=${{ github.sha }} \
--set slot=green \
--set service.enabled=false \
--atomic --wait --timeout 5m
# Step 2: Smoke test green slot via port-forward or internal service
- name: Smoke test green
run: |
kubectl port-forward svc/myapp-green 8080:80 -n production &
sleep 5
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
kill %1
if [ "$STATUS" != "200" ]; then
echo "::error::Green slot health check failed (HTTP $STATUS)"
helm uninstall myapp-green -n production
exit 1
fi
echo "Green slot is healthy β
"
# Step 3: Switch traffic from blue β green
- name: Switch traffic to green
run: |
kubectl patch service myapp -n production \
-p '{"spec":{"selector":{"slot":"green"}}}'
echo "Traffic switched to green β
"
# Step 4: Keep blue as rollback target (scale down after 30 min)
- name: Schedule blue cleanup
run: |
echo "Blue slot kept for rollback. Remove manually after validation:"
echo " helm uninstall myapp-blue -n production"
Strategy 3: Canary Deployment
Route a small percentage of traffic to the new version, monitor metrics, then promote or rollback.
canary-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- uses: azure/aks-set-context@v3
with:
resource-group: ${{ vars.RESOURCE_GROUP }}
cluster-name: ${{ vars.CLUSTER_NAME }}
# Deploy canary with 1 replica (stable has 5)
- name: Deploy canary
run: |
helm upgrade --install myapp-canary ./charts/myapp \
--namespace production \
--set image.tag=${{ github.sha }} \
--set replicaCount=1 \
--set canary=true \
--atomic --wait --timeout 5m
# Monitor error rate for 5 minutes
- name: Monitor canary health
run: |
echo "Monitoring canary for 5 minutes..."
for i in $(seq 1 30); do
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
https://myapp.example.com/health)
ERRORS=$(kubectl logs -l app=myapp,canary=true \
-n production --tail=50 | grep -c "ERROR" || true)
echo "Check $i/30 β HTTP: $STATUS, Errors: $ERRORS"
if [ "$ERRORS" -gt 5 ]; then
echo "::error::Canary error rate too high β rolling back"
helm uninstall myapp-canary -n production
exit 1
fi
sleep 10
done
echo "Canary healthy after 5 minutes β
"
# Promote: update stable release to new version
- name: Promote to stable
run: |
helm upgrade --install myapp ./charts/myapp \
--namespace production \
--set image.tag=${{ github.sha }} \
--set replicaCount=5 \
--atomic --wait --timeout 5m
# Remove canary
helm uninstall myapp-canary -n production
Rollback Strategies
| Method | Command / Config | When to Use |
|---|---|---|
| Automatic (--atomic) | helm upgrade --atomic | Failed deploy auto-reverts to previous release |
| Manual Helm rollback | helm rollback myapp <revision> | Post-deploy issue discovered after workflow completes |
| Re-deploy previous SHA | Re-run workflow with old commit's tag | When you know which exact version was good |
| Blue-green switch-back | kubectl patch svc β¦ selector: slot: blue | Instant traffic switch to previous environment |
| GitOps revert | git revert <bad-commit> β pipeline redeploys | When using GitOps workflow β Git is source of truth |
A rollback strategy that has never been tested is not a rollback strategy. Periodically run helm rollback in staging to verify it works β check that database migrations are backward-compatible and that no breaking config changes prevent the old version from starting.
Strategy Comparison
| Strategy | Downtime Risk | Rollback Speed | Complexity | Best For |
|---|---|---|---|---|
| Rolling Update | None (if probes set) | ~2β5 min | Low | Most applications |
| Blue-Green | None | Instant | Medium | Mission-critical services |
| Canary | Partial (% of traffic) | Fast (uninstall canary) | High | High-traffic services, risk mitigation |
π Deployment Flow Diagram
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββ
β Git Push βββββΆβ Lint & Test βββββΆβ Docker Build βββββΆβ ACR Push β
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββ¬ββββββ
β
ββββββββββββββββββββββββββββββββββββββββββ
βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ
β Helm Deploy βββββΆβ Smoke Tests βββββΆβ Manual Approval β
β (Staging) β β (Staging) β β β
ββββββββββββββββ ββββββββββββββββ ββββββββββ¬ββββββββββ
β
ββββββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββ ββββββββββββββ ββββββββββββββ
β Rolling β β Blue-Green β β Canary β
β Update β β Switch β β β Promote β
ββββββββββββββ ββββββββββββββ ββββββββββββββ
π οΈ Hands-on Lab
Lab 1: Configure Azure Credentials
- Create a Service Principal:
az ad sp create-for-rbac --name "github-actions-sp" --role contributor --scopes /subscriptions/<SUB_ID> - Store the output values as GitHub secrets:
AZURE_CLIENT_ID,AZURE_CLIENT_SECRET,AZURE_TENANT_ID,AZURE_SUBSCRIPTION_ID - Add repository variables:
ACR_NAME,RESOURCE_GROUP,CLUSTER_NAME - Test authentication by adding
azure/login@v2+az account showto a test workflow
Lab 2: Build & Push to ACR
- Create a simple Dockerfile in your repository (Node.js, Python, or any runtime)
- Add a workflow that builds the image and pushes to ACR with
${{ github.sha }}tag - Trigger the workflow and verify the image appears in ACR:
az acr repository show-tags --name <ACR_NAME> --repository myapp - Pull the image locally and run it to confirm it works:
docker run <ACR_NAME>.azurecr.io/myapp:<SHA>
Lab 3: Deploy to AKS with Helm
- Create a basic Helm chart:
helm create charts/myapp - Update
values.yamlto use your ACR image and ingress settings - Add the Helm deploy step to your workflow, targeting a
stagingnamespace - Push to
mainand watch the deployment in the Actions tab - Verify pods are running:
kubectl get pods -n staging
Lab 4: Smoke Tests Post-Deploy
- Add a smoke test job that runs after staging deployment
- Curl the health endpoint and assert a 200 response
- Configure the
productionenvironment with required reviewers in GitHub Settings β Environments - Push a change and verify the workflow pauses at the production gate until approved
π Debugging Common Issues
"ACR login failed"
- Wrong credentials: Double-check
ACR_USERNAMEandACR_PASSWORDsecrets. For admin-enabled ACR, the username is the registry name and the password is from the ACR Access Keys blade - Service Principal expired: SP secrets have a default expiry (1β2 years). Regenerate with
az ad sp credential resetand update the GitHub secret - Firewall rules: If ACR has network restrictions, GitHub-hosted runners may be blocked. Use a self-hosted runner inside the VNet or allow GitHub's IP ranges
"Helm deploy timeout"
- Pod crash loop: The container is starting and failing repeatedly. Check
kubectl logs <pod> -n <namespace>for startup errors (missing env vars, bad config) - Image pull error: AKS can't pull from ACR. Verify the ACR-AKS integration:
az aks check-acr --name <CLUSTER> --resource-group <RG> --acr <ACR_NAME>.azurecr.io - Resource limits: Pod requests more CPU/memory than the node can provide. Check
kubectl describe pod <pod>for scheduling failures - Readiness probe failing: The
--waitflag waits for readiness probes. If your probe endpoint isn't implemented or returns errors, the deploy will timeout
"az aks get-credentials failed"
- Wrong resource group or cluster name: Verify with
az aks list -o table - Insufficient permissions: The SP needs at minimum
Azure Kubernetes Service Cluster User Roleon the AKS resource - Cluster not running: If the cluster is stopped,
get-credentialssucceeds butkubectlcommands fail. Start it withaz aks start
"ImagePullBackOff" β AKS Can't Pull from ACR
- ACR not attached to AKS: Run
az aks check-acr --name <CLUSTER> --resource-group <RG> --acr <ACR_NAME>.azurecr.io. If not attached:az aks update -n <CLUSTER> -g <RG> --attach-acr <ACR_NAME> - Image tag doesn't exist: Verify:
az acr repository show-tags --name <ACR_NAME> --repository myapp -o table - Private ACR with network rules: AKS nodes must have network access to ACR. Check firewall rules and private endpoint config
"OIDC token request failed"
- Missing
id-token: writeβ OIDC requires this permission. Add it to the workflow or jobpermissions:block - Federated credential subject mismatch: The subject claim must match exactly. For environment-scoped deployments use
repo:ORG/REPO:environment:production, notref:refs/heads/main - Audience mismatch: Azure expects
api://AzureADTokenExchangeas the audience. Verify your federated credential config
Deployment Debugging Decision Tree
Helm deploy failed?
βββ "ACR login failed"
β βββ Credentials wrong β Check ACR_USERNAME/PASSWORD or OIDC config
β βββ SP expired β az ad sp credential reset
β βββ Firewall blocking β Whitelist GitHub runner IPs or use self-hosted
β
βββ "UPGRADE FAILED" / timeout
β βββ kubectl get pods β ImagePullBackOff?
β β βββ Image tag missing β Check ACR tags
β β βββ ACR not attached β az aks update --attach-acr
β βββ kubectl get pods β CrashLoopBackOff?
β β βββ kubectl logs <pod> β App startup error
β β βββ kubectl describe pod β Missing env vars, bad config
β βββ kubectl get pods β Pending?
β β βββ kubectl describe pod β Resource quota exceeded / no nodes
β βββ kubectl get pods β Running but not Ready?
β βββ Readiness probe failing β Check /health endpoint
β
βββ "OIDC token request failed"
β βββ Missing permissions: id-token: write
β βββ Subject claim mismatch β Check federated credential
β βββ Audience mismatch β Verify api://AzureADTokenExchange
β
βββ Workflow succeeded but app shows old version
βββ Wrong namespace β Check --namespace flag
βββ Image tag not updated β Check --set image.tag value
βββ Cached image β Check imagePullPolicy: Always
π― Interview Questions
Basic (5)
1. What is Azure Container Registry (ACR) and why is it used in CI/CD?
ACR is a managed Docker container registry hosted in Azure. In CI/CD, it serves as the central storage for Docker images built during the pipeline. It integrates natively with AKS, supports geo-replication, vulnerability scanning, and RBAC β making it the natural choice for Azure-based Kubernetes deployments.
2. What does helm upgrade --install do?
It's an idempotent deploy command. If the release doesn't exist yet, it performs helm install. If it already exists, it performs helm upgrade. This means the same command works for both first-time deployments and updates, which is ideal for CI/CD where you don't want to track release state.
3. Why tag Docker images with the Git commit SHA?
The commit SHA creates an immutable, one-to-one link between the source code and the deployed image. You can always trace exactly which code is running in any environment. Unlike latest or version tags, the SHA never changes β two different commits can never produce the same tag.
4. What are GitHub Environments used for in deployment workflows?
Environments define deployment targets (staging, production) with protection rules. You can require manual approval, restrict which branches can deploy, add wait timers, and scope secrets/variables per environment. This creates a controlled promotion path from staging to production.
5. What is the --wait flag in Helm and why is it important in CI/CD?
The --wait flag tells Helm to block until all deployed resources (pods, services, etc.) reach a Ready state. In CI/CD, this is critical because without it, the workflow would report success immediately even if pods are crash-looping. Combined with --timeout, it ensures the pipeline fails fast if the deployment is broken.
Intermediate (5)
6. Compare Service Principal vs OIDC authentication for GitHub Actions to Azure.
Service Principal uses a client secret stored as a GitHub secret β it works universally but the secret can leak, must be rotated, and is long-lived. OIDC (OpenID Connect) uses GitHub's built-in token, exchanged for an Azure access token via a trust relationship. No secret is stored, tokens are short-lived and scoped to the workflow run. OIDC is more secure and recommended for all new setups.
7. How do you perform a Helm rollback in a CI/CD pipeline when a deployment fails?
Use the --atomic flag in your helm upgrade command. If the upgrade fails (pods don't become ready within the timeout), Helm automatically rolls back to the previous release. Alternatively, add a failure handler step that runs helm rollback <release> <revision> using if: failure(). You can get the previous revision number from helm history.
8. Explain the role of the azure/aks-set-context action.
This action configures the kubeconfig for the workflow so that subsequent kubectl and helm commands target the correct AKS cluster. It fetches cluster credentials using the authenticated Azure session and sets the KUBECONFIG environment variable. Without it, Helm would have no cluster to deploy to.
9. How would you deploy the same Helm chart to multiple environments with different configurations?
Use environment-specific values files (values-staging.yaml, values-production.yaml) and pass them via --values. Combine with GitHub Environments to scope variables per environment β vars.INGRESS_HOST in staging resolves to staging.myapp.com, in production to myapp.com. The chart stays identical; only the values change.
10. What permissions does a Service Principal need for a GitHub Actions β ACR β AKS pipeline?
Minimum: AcrPush role on the ACR (to push images), Azure Kubernetes Service Cluster User Role on the AKS cluster (to get credentials and deploy), and Reader on the resource group. For OIDC, you also need a Federated Credential configured on the App Registration pointing to your GitHub repo and branch.
Senior (5)
11. Design a blue-green deployment strategy for AKS using GitHub Actions and Helm.
Maintain two namespaces (or label sets): blue (current live) and green (new version). The pipeline deploys to the inactive set using Helm, runs comprehensive smoke tests, then switches the ingress/service selector from blue to green. If smoke tests fail, no traffic shift occurs. After successful cutover, the old set is kept as an instant rollback target. Implement via Helm values: --set slot=green controls labels; a separate step updates the ingress annotation or service mesh routing rule.
12. How would you implement canary deployments to AKS from GitHub Actions?
Use a canary Helm release alongside the stable release. Deploy with --set replicaCount=1 for the canary version while keeping the stable release at full scale. Configure an ingress controller (like Nginx with canary annotations) or a service mesh (like Istio) to route a percentage of traffic (e.g., 5%) to the canary pods. Monitor error rates and latency in the smoke test job. If metrics look good, gradually increase the canary weight in subsequent workflow steps. If metrics degrade, run helm uninstall on the canary release.
13. A microservices team has 12 services deployed to AKS via GitHub Actions. How do you structure the pipelines?
Use a monorepo with path filters: each service triggers only when its directory changes (on.push.paths: ['services/auth/**']). Share a reusable workflow (.github/workflows/deploy-service.yml) that accepts inputs: service name, chart path, namespace. Each service's workflow calls the reusable one with its specific parameters. Use a matrix strategy for shared infrastructure components. Implement dependency ordering via needs: for services with startup dependencies. Centralize Helm chart templates in a shared library chart.
14. Your OIDC authentication works in staging but fails in production. What do you investigate?
Check: (1) The Federated Credential's subject filter β it may be scoped to ref:refs/heads/main but production uses a different branch or tag. (2) Environment-scoped Federated Credentials may restrict which environments can authenticate. (3) The permissions: id-token: write must be declared at the job or workflow level. (4) The Azure App Registration may have Conditional Access policies that differ by environment. (5) Check GitHub's OIDC token claims using curl $ACTIONS_ID_TOKEN_REQUEST_URL to see exactly what's being sent.
15. How do you ensure zero-downtime deployments to AKS via Helm in CI/CD?
Multiple layers: (1) RollingUpdate strategy in the Deployment with maxUnavailable: 0 β new pods start before old ones terminate. (2) Proper readiness probes so traffic only routes to healthy pods. (3) preStop lifecycle hooks with a sleep to allow in-flight requests to complete before pod termination. (4) Pod Disruption Budgets to prevent too many pods going down simultaneously. (5) --atomic flag in Helm so a failed upgrade auto-rolls back. (6) Connection draining configured on the ingress controller. All of these are configured in the Helm chart's templates and values.
π Real-World Scenario
A fintech startup runs 12 microservices on AKS, all deployed via GitHub Actions and Helm. Here's how their pipeline evolved:
Phase 1 β Manual deploys: Engineers ran helm upgrade from their laptops. Different developers had different kubeconfigs, sometimes deploying debug builds to production. Deployments were infrequent (weekly) because they were risky and time-consuming.
Phase 2 β Basic CI/CD: A single workflow built and deployed all 12 services on every push to main. Build times ballooned to 45 minutes. A bug in one service blocked deployment of all others.
Phase 3 β Optimized pipeline (current):
- Path-filtered triggers β each service only builds when its directory changes, reducing average pipeline time to 6 minutes
- Reusable workflow β a single
deploy-service.ymlaccepts service name, namespace, and chart path as inputs. All 12 services call the same workflow - Staging auto-deploy β every merge to
mainautomatically deploys to the staging namespace. Smoke tests run against staging endpoints - Production manual approval β the
productionenvironment requires approval from two senior engineers. Deployment window is enforced via environment protection rules - OIDC authentication β eliminated all stored Azure secrets. Federated credentials are scoped per environment
- Helm
--atomicβ three production incidents were automatically rolled back before users noticed, thanks to the atomic flag and proper readiness probes
Result: deployments went from weekly and risky to 20+ per day with zero downtime. Mean time to production dropped from 5 days to 15 minutes.
π Summary
- Pipeline flow: Code Push β Lint & Test β Docker Build β ACR Push β Helm Deploy β AKS (staging β production)
- Azure auth: Use
azure/login@v2with OIDC (recommended) or Service Principal credentials stored as secrets - ACR push: Tag images with
${{ github.sha }}for immutable traceability; use Docker layer caching (type=gha) for speed - Helm deploy:
helm upgrade --installwith--atomic --wait --timeout 5mfor safe, idempotent deployments - Environments: Use GitHub Environments to separate staging (auto-deploy) and production (manual approval)
- Smoke tests: Always validate staging before promoting to production β curl health endpoints, check response codes
- OIDC over SP: Federated credentials eliminate secret rotation and reduce leak risk
- Debugging: Most failures trace to auth issues (expired SP, wrong credentials), image pull errors (ACR-AKS integration), or pod crashes (check logs and describe)