Interview Preparation
50+ curated interview questions covering CI/CD fundamentals, GitHub Actions architecture, YAML pipelines, deployment strategies, security, and real-world troubleshooting β with detailed answers.
- Questions are grouped by category and difficulty.
- Each question has a collapsible detailed answer β try answering out loud before expanding.
- β marks questions that come up in 80%+ of interviews.
- "Scenario" questions simulate real whiteboard / design sessions.
π° Category 1 β CI/CD Fundamentals
Q1 β What is CI/CD and why is it important?
CI (Continuous Integration) β every developer commit triggers an automated build and test cycle. The goal is to catch bugs early and provide fast feedback. If a test fails, the team knows within minutes, not days.
CD (Continuous Delivery / Continuous Deployment) β the pipeline automatically packages and delivers the application to staging or production, reducing manual errors and shipping faster.
Key distinction:
- Continuous Integration: merge β build β test (automated)
- Continuous Delivery: β¦β deploy to staging (automated), deploy to production (manual gate)
- Continuous Deployment: β¦β deploy to staging AND production (fully automated, no human approval)
Why it matters: Without CI/CD, teams accumulate "integration debt" β the longer you wait to merge, the more painful it becomes. CI/CD converts that big-bang integration into many small, safe, reversible changes.
Q2 β Explain the difference between Continuous Delivery and Continuous Deployment.
Continuous Delivery: The pipeline automates everything up to staging. Production deployments require a manual approval gate β a human clicks "approve" before the code goes live. This is common in regulated industries (finance, healthcare).
Continuous Deployment: Fully automated β every commit that passes all tests goes straight to production with no human approval. This requires high confidence in your test suite and monitoring systems.
Interview tip: Most companies practice Continuous Delivery, not Deployment. Saying "we use CD with manual production gates and environment protection rules" shows pragmatism.
Q3: What are the benefits of CI/CD for a development team?
- Faster feedback loops β developers know within minutes if their change broke something.
- Reduced integration risk β small, frequent merges are far less risky than large, infrequent ones.
- Repeatable deployments β the same pipeline runs every time, eliminating "it works on my machine" problems.
- Audit trail β every deployment is linked to a commit, a build, and a set of test results.
- Developer confidence β teams ship more often when they trust the pipeline to catch mistakes.
- Cost reduction β catching bugs in CI is 10β100Γ cheaper than finding them in production.
Q4: Describe a typical CI/CD pipeline for a microservice.
A standard pipeline for a containerized microservice:
Lint β Unit Test β Build Docker Image β Push to Registry
β Deploy to Staging β Integration Tests β Manual Approval
β Deploy to Production β Smoke Test
In GitHub Actions terms:
- Lint job β ESLint, Pylint, or golangci-lint to catch style and error issues.
- Test job β run unit tests with coverage thresholds.
- Build & Push job β
docker/build-push-actionwith image tagged as the commit SHA. - Deploy staging job β
helm upgrade --installin a staging namespace. - Integration test job β hit staging endpoints and verify responses.
- Deploy production job β requires
environment: productionwith reviewer approval. - Smoke test job β verify health endpoint returns 200 in production.
Q5: What is a blue-green deployment?
Two identical environments run in parallel:
- Blue = current production (serving live traffic).
- Green = new version (deployed, tested, but not yet receiving traffic).
Once green passes health checks, the load balancer switches 100% of traffic from blue to green instantly. If something goes wrong, switch back to blue β instant rollback with zero downtime.
Trade-off: Requires double the infrastructure during the switch, so it's more expensive but safer.
Q6: What is a canary deployment?
Route a small percentage (5β10%) of live traffic to the new version while 90β95% continues hitting the old version. Monitor error rates, latency, and resource usage for a defined period.
- If metrics look good β gradually increase traffic to the new version (25% β 50% β 100%).
- If metrics degrade β immediately route all traffic back to the old version (rollback).
Canary deployments are less risky than blue-green because only a fraction of users are affected if something goes wrong. Often implemented via Ingress weights, service mesh (Istio), or Flagger.
Q7 β How do you handle database migrations in CI/CD?
Rule #1: Never run destructive migrations automatically.
Best practices:
- Run migrations as a separate step before application deployment.
- Use migration tools like Flyway, Liquibase, or language-specific tools (Alembic for Python, TypeORM for Node.js).
- Make migrations backward-compatible β the old app version should still work after the migration runs (expand-and-contract pattern).
- Separate schema migration from data migration β schema changes are fast, data backfills are slow.
- Add a manual gate before destructive operations (dropping columns, truncating tables).
jobs:
migrate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: flyway -url=${{ secrets.DB_URL }} migrate
deploy:
needs: migrate # Deploy only after migration succeeds
runs-on: ubuntu-latest
steps:
- run: helm upgrade myapp ./chart
Q8: What is GitOps and how does it relate to CI/CD?
GitOps = Git as the single source of truth for both infrastructure and application state.
- CI pushes changes to a Git repository (e.g., updates an image tag in a Helm values file).
- CD is handled by a GitOps operator (ArgoCD, Flux) running inside the cluster. The operator watches Git and reconciles cluster state to match the desired state in Git.
Key difference from traditional CI/CD: the pipeline doesn't directly kubectl apply or helm upgrade. Instead, it commits to Git, and the in-cluster operator pulls and applies. This gives you: audit trail (every change is a Git commit), drift detection (operator alerts if someone changes the cluster manually), and declarative state management.
Q9: What's the difference between a build artifact and a container image?
Build artifact: The compiled output of your build process β a JAR file, a Go binary, a webpack bundle, or an npm package. It's portable but needs a compatible runtime environment to execute.
Container image: An artifact + its runtime + OS-level dependencies, packaged as an OCI image (layers). It's a deployable unit β it runs identically on any machine with a container runtime.
Analogy: An artifact is the recipe. A container image is the meal-prep kit β recipe, ingredients, and utensils, all in one box. You just heat and serve.
Q10 β How do you implement rollback in a CI/CD pipeline?
Multiple strategies, often combined:
- Helm rollback:
helm rollback myapp <previous-revision>β restores the previous Helm release. - Re-deploy previous image: Trigger the pipeline with the previous commit SHA. Since images are tagged with SHA, this deploys the known-good version.
- Git revert:
git revert <bad-commit>and push β the pipeline deploys the reverted code automatically. - Blue-green switch: Route traffic back to the previous environment instantly.
Critical practice: Always tag images with the commit SHA, not latest. This gives you full traceability β you can always map a running container back to the exact code that produced it.
βοΈ Category 2 β GitHub Actions Architecture
Q11 β Explain the relationship between events, workflows, jobs, and steps in GitHub Actions.
The full hierarchy:
Event (push, PR, schedule, workflow_dispatch)
βββ Workflow (.github/workflows/ci.yml)
βββ Job (runs on a specific runner)
βββ Step (a single unit: action or shell command)
- Event: The trigger β a push, a pull request, a cron schedule, or a manual dispatch.
- Workflow: A YAML file in
.github/workflows/that responds to one or more events. - Job: A set of steps that run on the same runner. Jobs run in parallel by default, but you can create dependencies with
needs:. - Step: A single unit of work β either a shell command (
run:) or an action (uses:). Steps run sequentially within a job and share the same filesystem.
Key detail: Steps share the runner workspace (filesystem + environment variables). Jobs do not β each job gets a fresh runner. To share data between jobs, use outputs or artifacts.
Q12: What is the difference between uses and run in a step?
uses:invokes a pre-built action β a JavaScript action, Docker action, or composite action from the Marketplace or a local path. Example:uses: actions/checkout@v4run:executes inline shell commands directly on the runner. Example:run: npm test
Important: You cannot combine uses: and run: in the same step. Each step is either an action invocation or a shell command, never both.
# β Correct β separate steps - uses: actions/checkout@v4 - run: npm install && npm test # β Invalid β can't mix uses and run - uses: actions/checkout@v4 run: npm test
Q13 β How do jobs communicate with each other?
Three mechanisms:
needs:β defines execution order. Job B waits for Job A to complete before starting.- Outputs β pass small data (strings) between jobs:
yaml
jobs: build: runs-on: ubuntu-latest outputs: image-tag: ${{ steps.tag.outputs.tag }} steps: - id: tag run: echo "tag=sha-$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT deploy: needs: build runs-on: ubuntu-latest steps: - run: echo "Deploying ${{ needs.build.outputs.image-tag }}" - Artifacts β share files between jobs using
actions/upload-artifactandactions/download-artifact. Use for build outputs, test reports, or any data too large for outputs (which have a 1 MB limit).
Q14: What is the GITHUB_TOKEN and how does it work?
GITHUB_TOKEN is an auto-generated, short-lived token created for every workflow run. Key properties:
- Scoped to the current repository β it cannot access other repos.
- Expires when the job completes β not reusable after the run.
- Permissions are configurable via the
permissions:block in the workflow YAML. - Default permissions depend on repository settings β can be "read all" or "read + write" (Settings β Actions β General β Workflow permissions).
How it differs from a PAT (Personal Access Token): PATs are created by a user, can access any repo the user has access to, and must be rotated manually. GITHUB_TOKEN is automatic, scoped, and ephemeral β always prefer it over PATs in workflows.
Q15: Explain GitHub-hosted vs self-hosted runners. When would you choose each?
| Aspect | GitHub-hosted | Self-hosted |
|---|---|---|
| Maintenance | Zero β GitHub manages VMs | You maintain hardware, OS, tools |
| Environment | Fresh VM per job (clean state) | Persistent (state carries over) |
| Tools | Pre-installed (Node, Python, Docker, etc.) | You install what you need |
| Cost | Pay per minute (free tier for public repos) | Your hardware cost, no per-minute charges |
| Network | Public internet only | Access to internal networks, on-prem resources |
Choose self-hosted when: you need internal network access, custom hardware (GPUs), compliance requirements, or high volume to reduce cost.
Security warning: Never use self-hosted runners on public repositories. A malicious PR could run arbitrary code on your infrastructure.
Q16 β What is workflow_dispatch and when would you use it?
workflow_dispatch enables manual triggering of a workflow from the GitHub UI or API, with optional custom inputs.
on:
workflow_dispatch:
inputs:
environment:
description: 'Target environment'
required: true
type: choice
options: [staging, production]
version:
description: 'Image tag to deploy'
required: true
type: string
Use cases: on-demand deployments, hotfix releases, maintenance tasks (database backups, cache purges), testing workflows during development.
Q17: How does matrix strategy work? Give an example.
Matrix strategy runs the same job across multiple configurations in parallel. It creates a Cartesian product of the matrix dimensions.
strategy:
matrix:
node-version: [18, 20, 22]
os: [ubuntu-latest, windows-latest]
# Creates 3 Γ 2 = 6 parallel jobs
include:
- node-version: 22
os: ubuntu-latest
coverage: true # Extra variable for this combo
exclude:
- node-version: 18
os: windows-latest # Skip this combination
Access values in steps via ${{ matrix.node-version }} and ${{ matrix.os }}. Use include to add specific combos or extra variables, and exclude to remove unwanted combos.
Q18: What are reusable workflows vs composite actions? When do you use each?
| Feature | Reusable Workflow | Composite Action |
|---|---|---|
| Level | Workflow-level (has jobs) | Step-level (runs in calling job) |
| Trigger | workflow_call | uses: in a step |
| Secrets | Passed explicitly or inherit | Via inputs / env |
| Environments | Can reference environments | Cannot |
| Best for | Standardized pipelines across repos | Shared step sequences (lint + test) |
Rule of thumb: If you need full jobs with environments and secrets β reusable workflow. If you need a few steps bundled together β composite action.
Q19: How does concurrency work in GitHub Actions?
The concurrency key groups workflow runs and controls what happens when multiple runs of the same group are active.
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
How it works: If a new run starts with the same concurrency group, the previous in-progress run is cancelled. This is critical for PR builds β if a developer pushes 3 commits in quick succession, only the latest commit's run proceeds.
Without cancel-in-progress: the new run queues until the previous one finishes.
Q20 β Explain the needs keyword and how job dependencies work.
needs: creates a Directed Acyclic Graph (DAG) of job execution order.
jobs:
lint: ... # Runs immediately
test: ... # Runs immediately (parallel with lint)
build:
needs: [lint, test] # Waits for BOTH lint and test
deploy:
needs: build # Waits for build
Behavior when a needed job fails: all dependent jobs are skipped. To override this, use if: always() or if: failure() on the dependent job.
Accessing outputs: ${{ needs.build.outputs.image-tag }} β you can only access outputs from jobs listed in your needs:.
π Category 3 β YAML Pipelines & Syntax
Q21 β Walk me through a GitHub Actions YAML file from top to bottom.
name: CI Pipeline # Display name in the Actions tab
on: # Trigger events
push:
branches: [main]
pull_request:
branches: [main]
permissions: # Token permissions (least privilege)
contents: read
env: # Workflow-level environment variables
REGISTRY: myacr.azurecr.io
jobs:
build: # Job ID
runs-on: ubuntu-latest # Runner selection
needs: [lint] # Job dependency (optional)
if: github.ref == 'refs/heads/main' # Condition (optional)
strategy: # Matrix (optional)
matrix:
node: [18, 20]
steps:
- uses: actions/checkout@v4 # Action step
- run: npm install # Shell step
env:
NODE_ENV: production # Step-level env var
Order matters for readability, but only on: and jobs: are required at the top level.
Q22: What are common YAML mistakes that break workflows?
- Tabs instead of spaces β YAML only allows spaces. A single tab will silently break the workflow.
- Inconsistent indentation β mixing 2-space and 4-space indentation within the same level.
- Unquoted colons in
run:βrun: echo Time: 12:30breaks because YAML interprets the second colon as a key-value separator. - Missing
${{ }}in expressions β outsideif:, you must wrap expressions:${{ env.MY_VAR }}. Insideif:, expressions are implicit. - Duplicate keys β two
env:blocks at the same level silently overwrite. - Wrong file path β workflow files must be in
.github/workflows/with a.ymlor.yamlextension.
Q23: How do path filters and branch filters work?
on:
push:
branches: [main, 'release/**'] # Only these branches
paths: ['src/**', 'package.json'] # Only when these files change
paths-ignore: ['docs/**', '*.md'] # Ignore these files
Rules:
branchesandpathscan be combined β both must match.pathsandpaths-ignorecannot be used together.- Path filters use glob patterns:
*matches single level,**matches multiple levels. - If only docs changed and your paths filter doesn't include docs, the workflow won't run.
Q24: Explain if: conditions with an example.
Conditional execution at the job or step level:
# Only deploy from main branch - run: helm upgrade myapp ./chart if: github.ref == 'refs/heads/main' # Only run on pull requests - run: echo "This is a PR" if: github.event_name == 'pull_request' # Run even if previous steps failed - run: echo "Sending failure notification" if: failure() # Always run (cleanup) - run: rm -rf ./tmp if: always() # Skip CI based on commit message - run: npm test if: "!contains(github.event.head_commit.message, '[skip ci]')"
Note: Expressions in if: are automatically wrapped in ${{ }}, so you don't need to write them explicitly.
Q25: What is the ${{ }} expression syntax?
The expression evaluator for GitHub Actions. It can access contexts and call functions:
Contexts: github, env, secrets, matrix, needs, steps, runner, job, inputs
Functions:
contains(search, item)β string or array contains check.startsWith(string, prefix)/endsWith()toJSON(value)/fromJSON(value)β serialize/deserialize (critical for dynamic matrices).hashFiles('**/package-lock.json')β generates a hash for cache keys.format()β string formatting.
Key rule: Inside if:, expressions are implicit. Everywhere else, you must use ${{ }}.
Q26 β How do outputs work between steps and between jobs?
Between steps (same job):
steps:
- id: version
run: echo "tag=v1.2.3" >> $GITHUB_OUTPUT
- run: echo "Version is ${{ steps.version.outputs.tag }}"
Between jobs:
jobs:
build:
outputs:
tag: ${{ steps.version.outputs.tag }}
steps:
- id: version
run: echo "tag=v1.2.3" >> $GITHUB_OUTPUT
deploy:
needs: build
steps:
- run: echo "${{ needs.build.outputs.tag }}"
Important: $GITHUB_OUTPUT replaced the deprecated set-output command. Always use the file-based approach.
Q27: What does continue-on-error: true do vs if: always()?
continue-on-error: trueβ the step/job reports success even if it fails. The overall job status stays green. Use for non-critical steps (e.g., optional code quality tools).if: always()β the step runs regardless of previous step results, but its own result affects the job status normally. Use for cleanup or notification steps.
Example: A notification step should use if: always() so it runs when things fail. An experimental lint tool should use continue-on-error: true so it doesn't block the pipeline.
Q28: Explain the timeout-minutes setting.
Default job timeout is 360 minutes (6 hours). This is dangerously high for most workflows.
jobs:
build:
runs-on: ubuntu-latest
timeout-minutes: 15 # Job-level timeout
steps:
- run: npm test
timeout-minutes: 10 # Step-level timeout
Best practice: Always set a timeout. A stuck npm install or a test that hangs can run for 6 hours, consuming your minutes budget. Set to 2β3Γ the expected duration.
π Category 4 β Security & Secrets
Q29 β How are secrets managed in GitHub Actions? Explain the hierarchy.
Three scopes, from broadest to narrowest:
- Organization secrets β shared across selected (or all) repositories in the org.
- Repository secrets β available to all workflows in that repo.
- Environment secrets β only available to jobs that reference that specific environment.
Override order: Environment > Repository > Organization. If the same secret name exists at multiple levels, the most specific scope wins.
Security properties:
- Encrypted at rest using libsodium sealed boxes.
- Automatically masked in logs (the raw value is replaced with
***). - Not accessible in fork PRs β this prevents secret exfiltration from malicious forks.
- Never exposed in workflow YAML β accessed only through
${{ secrets.NAME }}.
Q30 β What is OIDC and why is it better than stored secrets for cloud auth?
OIDC (OpenID Connect) enables passwordless authentication between GitHub Actions and cloud providers (Azure, AWS, GCP).
How it works:
- GitHub Actions presents a JWT (JSON Web Token) with claims about the workflow (repo, branch, environment).
- The cloud provider validates the JWT against GitHub's OIDC endpoint.
- If the claims match the trust policy, the provider issues short-lived credentials (15β60 minutes).
Why it's better than stored secrets:
- No secret rotation β there's nothing stored to rotate.
- No leak risk β no long-lived credentials exist anywhere.
- Fine-grained trust β the cloud provider can restrict access by repo, branch, environment, and even workflow.
- Audit trail β every token request is logged with full context.
permissions:
id-token: write # Required for OIDC
contents: read
steps:
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
Trade-off: OIDC is better but requires initial setup in Azure AD (federated credentials). Worth the one-time effort for every production workflow.
Q31: A secret value appears in your workflow logs. How did this happen and how do you fix it?
Common causes:
- Secret used in a URL:
git clone https://user:${{ secrets.TOKEN }}@github.com/...β the URL gets logged and the masking may not catch it as part of a larger string. - Base64 encoded: masking only hides the original value. The encoded version is a different string.
- Sub-string match failure: very short secrets (e.g., 3 characters) might not get masked reliably.
- Environment variable expansion: some tools echo environment variables as part of their verbose output.
Fixes:
- Always pass secrets via
env:blocks, never inline inrun:. - Use
add-maskfor derived values:echo "::add-mask::$DERIVED_VALUE" - Suppress output for sensitive commands: redirect to file or
/dev/null. - Audit all steps that reference secrets and check their output behavior.
Q32: How do you handle third-party action security?
Third-party actions run code on your runner with access to your secrets. Best practices:
- Pin by SHA, not tag:
uses: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29β tags can be moved; SHAs are immutable. - Enable Dependabot for actions: automatically creates PRs when pinned actions have updates.
- Review source code before first use β check what the action does with secrets and network access.
- Limit to trusted publishers β GitHub verified creators badge.
- Use org-level allow-lists β restrict which actions can be used across all repositories.
Q33: What permissions should a production deployment workflow have?
Follow the principle of least privilege:
permissions: contents: read # Read repo code id-token: write # OIDC authentication # Everything else: implicitly denied
Additionally:
- Environment protection rules β required reviewers for the
productionenvironment. - Wait timers β enforce a delay between staging and production deploys.
- Branch restrictions β only allow deployments from
mainorrelease/*branches. - Deployment branch policy β prevent ad-hoc deployments from feature branches.
Never use permissions: write-all in production workflows.
Q34: What happens when a workflow runs from a forked repository?
When a fork opens a PR against your repo:
GITHUB_TOKENis read-only β cannot push, create releases, or modify the target repo.- Secrets are NOT available β
${{ secrets.MY_SECRET }}resolves to an empty string. - First-time contributors require maintainer approval before any workflow run (configurable in Settings β Actions β Fork pull request workflows).
Why? This prevents secret exfiltration. A malicious actor could fork your repo, modify the workflow to echo ${{ secrets.PROD_PASSWORD }}, and steal credentials. The fork restriction prevents this entirely.
Q35 β How do you implement environment-based deployment gates?
- Create environments in Settings β Environments (e.g.,
staging,production). - Add protection rules:
- Required reviewers β 1+ people must approve before the job proceeds.
- Wait timer β enforce a cooldown (e.g., 30 minutes between staging and production).
- Branch restrictions β only
maincan deploy to production.
- Reference in workflow:
jobs:
deploy-production:
runs-on: ubuntu-latest
environment: production # β This pauses for approval
steps:
- run: helm upgrade myapp ./chart --namespace production
The job will pause and wait for a reviewer to approve in the GitHub UI before executing any steps.
Q36: How would you audit all secrets usage across an org's workflows?
- GitHub audit log β tracks which workflows accessed which secrets and when.
- Code search β search all workflow files for
secrets.references:org:myorg path:.github/workflows "secrets." - Org-level secret policies β centrally manage which repos can access which secrets.
- Migrate to OIDC β reduces the number of stored secrets that need auditing.
- Automate with a workflow β create a scheduled workflow that scans all repos' workflow files and reports secret usage.
π Category 5 β Real-World Deployment Scenarios
Q37 ββ Design a CI/CD pipeline to deploy a containerized app to AKS using GitHub Actions and Helm.
Full pipeline architecture:
ββββββββ ββββββββ ββββββββββββββ ββββββββββββ
β Lint ββββΆβ Test ββββΆβ Build+Push ββββΆβ Deploy β
ββββββββ ββββββββ β DockerβACR β β Staging β
ββββββββββββββ ββββββ¬ββββββ
β
ββββββΌββββββ ββββββββββββ
β Smoke ββββΆβ Approval β
β Test β β Gate β
ββββββββββββ ββββββ¬ββββββ
β
ββββββΌβββββββ
β Deploy β
β Productionβ
βββββββββββββ
name: Deploy to AKS
on:
push:
branches: [main]
permissions:
id-token: write
contents: read
env:
ACR_NAME: myacr
IMAGE: myacr.azurecr.io/myapp
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm run lint
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci && npm test
build-push:
needs: [lint, test]
runs-on: ubuntu-latest
outputs:
image-tag: ${{ steps.meta.outputs.tag }}
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: az acr login --name $ACR_NAME
- id: meta
run: echo "tag=sha-$(git rev-parse --short HEAD)" >> $GITHUB_OUTPUT
- run: |
docker build -t $IMAGE:${{ steps.meta.outputs.tag }} .
docker push $IMAGE:${{ steps.meta.outputs.tag }}
deploy-staging:
needs: build-push
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: az aks get-credentials --resource-group myRG --name myAKS
- run: |
helm upgrade --install myapp ./chart \
--namespace staging \
--set image.tag=${{ needs.build-push.outputs.image-tag }} \
--atomic --wait --timeout 5m
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production # Pauses for approval
steps:
- uses: actions/checkout@v4
- uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- run: az aks get-credentials --resource-group myRG --name myAKS
- run: |
helm upgrade --install myapp ./chart \
--namespace production \
--set image.tag=${{ needs.build-push.outputs.image-tag }} \
--atomic --wait --timeout 5m
Key decisions: OIDC for Azure auth (no stored credentials), --atomic --wait for safe deploys (auto-rollback on failure), commit SHA tags for traceability, environments for approval gates.
Q38 β Your production deployment via Helm fails with a timeout. Walk me through your debugging process.
Step-by-step diagnosis:
- Check GitHub Actions logs β identify which step failed. Is it the
helm upgradestep? What does the error say? - Check pod status:
bash
kubectl get pods -n production # Look for: CrashLoopBackOff, ImagePullBackOff, Pending
- Describe the failing pod:
bash
kubectl describe pod <pod-name> -n production # Check Events section: image pull errors, resource limits, scheduling failures
- Check application logs:
bash
kubectl logs <pod-name> -n production kubectl logs <pod-name> -n production --previous # Previous crash
- Verify the image exists in ACR:
az acr repository show-tags --name myacr --repository myapp - Check resource quotas:
kubectl describe resourcequota -n production - Check readiness/liveness probes β misconfigured probes are the #1 cause of Helm timeouts.
- Rollback if needed:
helm rollback myapp -n production
Root cause hierarchy: 80% of Helm timeouts are caused by: 1) bad readiness probes, 2) wrong image tag, 3) insufficient resources, 4) missing secrets/configmaps.
Q39: How would you implement a canary deployment in GitHub Actions?
- Deploy the canary as a separate Helm release with 1 replica and a low traffic weight:
bash
helm upgrade --install myapp-canary ./chart \ --set replicaCount=1 \ --set ingress.canaryWeight=10 \ --namespace production
- Run health checks for N minutes β monitor error rates and latency via Prometheus/Grafana or an API health endpoint.
- If healthy β promote: update the main release with the new image and remove the canary.
bash
helm upgrade myapp ./chart --set image.tag=$NEW_TAG helm uninstall myapp-canary
- If unhealthy β rollback: remove the canary, main release is untouched.
bash
helm uninstall myapp-canary
For advanced canary, use Flagger with Istio or NGINX β it automates the traffic shifting and analysis.
Q40 β How do you handle rollback in a Helm-based CI/CD pipeline?
Automatic rollback: Use --atomic flag with helm upgrade. If the deployment fails (pods don't become ready within the timeout), Helm automatically rolls back to the previous release.
helm upgrade --install myapp ./chart --atomic --wait --timeout 5m
Manual rollback:
helm history myapp -n production # List revisions helm rollback myapp 3 -n production # Rollback to revision 3
CI-triggered rollback: Create a separate workflow_dispatch workflow that accepts a revision number and runs helm rollback. Or simply re-run a previous successful workflow β since images are tagged with commit SHA, it deploys the known-good version.
Best practice: Always test your rollback procedure. A rollback you've never tested is a rollback that will fail when you need it most.
Q41: A workflow takes 15 minutes. How would you optimize it?
Optimization strategies (ordered by impact):
- Parallelize independent jobs β lint, test, and security scan can run in parallel instead of sequentially.
- Add dependency caching:
yaml
- uses: actions/setup-node@v4 with: node-version: 20 cache: 'npm' # Built-in caching - Docker layer caching β use
cache-from: type=ghawithdocker/build-push-action. - Reduce test scope for PRs β only run tests for changed modules using path filters or a test impact analysis tool.
- Use larger runners β
ubuntu-latest-4-cores(GitHub-hosted) for compute-heavy jobs. - Skip unnecessary steps β use path filters to skip CI for docs-only changes.
- Cancel stale runs β use
concurrencywithcancel-in-progress: true.
Q42: How do you manage CI/CD for a monorepo with 10+ microservices?
Key strategies:
- Path filters per service: each workflow triggers only when its service's files change.
- Dynamic matrix to detect changed services and build only those:
yaml
- uses: dorny/paths-filter@v3 id: changes with: filters: | api: 'services/api/**' web: 'services/web/**' auth: 'services/auth/**' # Later: build a dynamic matrix from the changed services - id: matrix run: | echo "services=$(echo '${{ toJSON(steps.changes.outputs) }}' | jq -c '[to_entries[] | select(.value=="true") | .key]')" >> $GITHUB_OUTPUT - Reusable workflows for standardized CI β every service calls the same build-and-deploy workflow.
- Per-service Helm charts β each service has its own chart in
services/<name>/chart/. - Parallel deploys β use a matrix job to deploy all changed services simultaneously.
Q43: How would you implement a multi-region deployment?
Strategy:
- Deploy to primary region first.
- Run smoke tests against the primary region.
- Deploy to secondary regions in parallel using a matrix strategy:
yaml
deploy-secondary: needs: deploy-primary strategy: matrix: region: [westeurope, southeastasia, brazilsouth] include: - region: westeurope cluster: aks-eu rg: rg-eu - region: southeastasia cluster: aks-asia rg: rg-asia - region: brazilsouth cluster: aks-brazil rg: rg-brazil - Use environment-per-region with individual approval gates if needed.
Considerations: DNS failover (Azure Traffic Manager / Front Door), database replication lag (deploy DB migrations to primary first, wait for replication), and regional configuration differences (CDN endpoints, compliance settings).
Q44 β Your colleague's workflow passes in their fork but fails in the main repo. What are the possible causes?
Common causes:
- Secrets not available in fork: fork PRs don't receive secrets from the upstream repo. The workflow "passes" in the fork because the step that needs the secret silently uses an empty string.
- Different branch protection rules: the main repo may require status checks or CODEOWNERS review that the fork doesn't.
- Org policy restrictions: the main repo's organization may restrict which actions are allowed.
- Environment protection rules: environments with required reviewers exist in the main repo but not the fork.
- Different runner availability: the main repo might use org-specific self-hosted runners that the fork doesn't have access to.
- CODEOWNERS restrictions: changes to certain paths require review from specific code owners in the main repo.
Debugging approach: Compare the workflow runs β check exactly which step fails and what the error message says. Most often it's a secrets or permissions issue.
π― Category 6 β Scenario-Based Deep Dives
These are long-form whiteboard-style questions. Interviewers expect structured, phased answers β not a single sentence.
Q45 ββ You are hired as the first DevOps engineer at a startup with 5 developers. They deploy manually via SSH. Design their CI/CD strategy using GitHub Actions.
Phased approach (don't try to do everything at once):
| Phase | Timeline | Goal |
|---|---|---|
| Phase 1 | Week 1 | Add basic CI β lint + unit tests on every PR. This gives immediate feedback and builds trust in automation. |
| Phase 2 | Week 2 | Containerize the app β create a Dockerfile, add a Docker build step to CI. Ensure every build produces an immutable image. |
| Phase 3 | Week 3 | Set up AKS (or ECS), create Helm charts, add automated deployment to a staging environment. Developers start seeing their code running in a real cluster. |
| Phase 4 | Week 4 | Add production environment with approval gates. Migrate from SSH deployments to pipeline-driven deployments. Remove SSH access to production. |
| Phase 5 | Ongoing | Add monitoring and alerting (Prometheus + Grafana), add rollback automation, optimize pipeline speed, add security scanning. |
Key principles:
- Start small and iterate β don't build the entire pipeline in week 1.
- Make it the default β remove the old SSH workflow completely once the pipeline is stable.
- Get developer buy-in β show them the pipeline saves time, don't force it.
- Document everything β the team needs to understand and maintain the pipeline without you.
Q46: Your organization has 200 repositories. How do you standardize CI/CD?
- Central
.githubrepository β contains reusable workflows that all repos call viaworkflow_call. Update once, all 200 repos benefit. - Org-level secrets and variables β shared credentials (registry URL, OIDC config) managed centrally.
- Custom actions for common patterns β create composite actions for standard tasks (lint, build Docker, deploy Helm).
- Required status checks β branch protection rules enforce that specific workflows must pass before merging.
- Org-wide runner groups β self-hosted runners shared across repos with proper access controls.
- Dependabot for action updates β a single Dependabot config template keeps all repos' action versions current.
- Internal documentation β "How to CI/CD at [Company]" guide with templates and examples.
Q47 β A deployment to production succeeded in GitHub Actions but users report the app is down. Walk through your incident response.
Incident response sequence:
- Verify the pipeline β check GitHub Actions logs. All steps green? Build, push, and deploy all succeeded?
- Check pod status:
bash
kubectl get pods -n production # Are pods Running? Ready? Restarting?
- Check application logs:
kubectl logs <pod> -n productionβ look for runtime errors, connection failures, OOM kills. - Check ingress / DNS: Is the ingress controller healthy? Did the DNS update propagate? Can you curl the health endpoint directly from inside the cluster?
- Check health endpoints:
curl -v https://myapp.com/healthz - Compare Helm release values with expected:
helm get values myapp -n production - Rollback immediately:
helm rollback myapp -n productionβ restore service first, investigate later. - Post-mortem: Add smoke tests to the pipeline (they would have caught this), add readiness probe checks, add synthetic monitoring.
Golden rule: Restore service first, investigate second. A 5-minute rollback is better than 30 minutes of debugging while users are impacted.
Q48: How would you implement feature flags with CI/CD?
Core principle: Decouple deployment from release.
- Deployment = code is on production servers (handled by CI/CD).
- Release = feature is visible to users (handled by feature flags).
Implementation:
- Use a feature flag service β LaunchDarkly, Unleash (open source), or a simple config-based system.
- Deploy code with flags off β the feature is in production but invisible to users.
- Enable gradually via the flag service β 5% β 25% β 50% β 100%.
- If issues arise, disable the flag instantly β no redeployment needed.
CI/CD integration: The pipeline deploys code. The flag service controls behavior. They are independent systems. Never use CI/CD to toggle feature flags β that defeats the purpose of instant rollback.
Q49: Explain how you'd set up CI/CD for a regulated industry (healthcare/finance).
Compliance-focused pipeline requirements:
- Signed commits β require GPG or SSH commit signatures via branch protection rules.
- Required reviewers (2+) β enforce four-eyes principle for all production changes.
- SOC2-compliant audit trail β GitHub's audit log captures every workflow run, every secret access, every deployment event.
- OIDC only β no stored secrets for cloud authentication. Eliminates credential rotation compliance burden.
- SHA-pinned actions β immutable action references prevent supply chain attacks.
- SLSA provenance β generate build provenance attestations (
actions/attest-build-provenance) to prove what code produced which artifact. - Environment wait timers β mandatory cooling periods between staging and production.
- Separate environments β staging β pre-production β production, each with its own approval chain.
- Automated compliance scanning β add SAST (CodeQL), dependency scanning (Dependabot), container scanning (Trivy) as required pipeline stages.
Q50: Your CI pipeline costs $5,000/month. How do you reduce it?
Cost reduction strategies (ordered by impact):
- Analyze runner minutes usage β find the top 10 workflows by minutes consumed. Focus optimization efforts there.
- Add path filters β skip CI entirely for docs-only, README, or config-only changes.
- Enable caching β
npm ciwith cache saves 1β3 minutes per run. Docker layer caching saves 5+ minutes. - Use
concurrencyβ cancel stale runs. If a developer pushes 5 commits, only the last one needs to complete. - Reduce matrix scope for PRs β run the full matrix (all OS Γ all versions) only on
main. For PRs, test on the most common configuration only. - Consider self-hosted runners β for high-volume repos (100+ runs/day), the break-even point is often reached within a month.
- Use spot instances for non-critical jobs β self-hosted runners on spot/preemptible VMs cost 60β90% less.
- Right-size runner specs β not every job needs a 4-core runner. Use the default 2-core for simple tasks.
Quick win: Adding concurrency: cancel-in-progress and path filters alone often reduces costs by 30β40%.
π Tips for Interviews
- Always mention trade-offs β e.g., "OIDC is better for security but requires Azure AD federated credential setup."
- Use specific tool names β say
actions/checkout@v4,helm upgrade --install --atomic,dorny/paths-filterinstead of vague descriptions. - Draw pipeline diagrams when explaining architecture β interviewers love visual thinkers.
- Share real examples from your experience β "In my last project, we reduced pipeline time from 12 minutes to 4 byβ¦"
- Know the differences: GitHub Actions vs Azure DevOps vs Jenkins vs GitLab CI β each has trade-offs. Be prepared to compare.
π Summary & Course Completion
You've completed the GitHub Actions β Zero to Hero course. You've covered CI/CD fundamentals, YAML pipelines, secrets management, reusable workflows, AKS deployments with Helm, security hardening, debugging, and interview preparation.
You now have the knowledge to design, build, secure, and troubleshoot production-grade CI/CD pipelines. Go build something great β and ace that interview. π