Hands-on Lesson 13 of 14

Debugging Workflows

Master the art of diagnosing and fixing failed GitHub Actions workflows β€” from YAML syntax errors to production deployment failures.

πŸ§’ Simple Explanation (ELI5)

Imagine your car breaks down on the highway. You don't replace the entire engine β€” that would be insane. Instead, you follow a diagnostic process:

A junior developer sees a red ❌ and panics. A senior developer sees a red ❌ and thinks: "Great, the system is telling me exactly where to look." This page teaches you that systematic diagnostic process so you never panic at a failed workflow again.

πŸ› οΈ Debugging Toolkit

Before diving into specific scenarios, arm yourself with these essential tools. Every GitHub Actions developer should know these exist.

1. Enable Debug Logging (Step-Level)

Go to Settings β†’ Secrets and variables β†’ Actions β†’ New repository secret and create:

text
Name:  ACTIONS_STEP_DEBUG
Value: true

This enables verbose output for every step β€” you'll see every command being run, environment variable resolution, and internal action details. The logs grow 5–10Γ— larger but contain the exact line that failed.

2. Enable Runner Diagnostic Logging

text
Name:  ACTIONS_RUNNER_DEBUG
Value: true

This shows runner-level diagnostics β€” job setup, Docker layer pulls, cache resolution, and internal runner operations. Useful when the issue isn't in your code but in the runner environment itself.

3. Re-run with Debug Logging (No Secrets Needed)

On any failed workflow run, click "Re-run all jobs" β†’ check the "Enable debug logging" checkbox. This is the quickest way to enable debug logs without creating secrets. The logs apply only to that single re-run.

4. Download Full Logs

On any workflow run page, click the gear icon βš™οΈ β†’ "Download log archive". You get a ZIP file containing the raw log for every job and step β€” perfect for searching with grep or sharing with teammates.

5. act β€” Run Actions Locally

Install nektos/act to run workflows on your local machine before pushing:

bash
# Install (macOS/Linux)
brew install act

# Run the default push event
act

# Run a specific job
act -j build

# Run with secrets from a .secrets file
act --secret-file .secrets
⚠️
Warning

act doesn't perfectly replicate GitHub-hosted runners. It uses Docker images that approximate the runner environment. Some actions (especially those using OIDC or runner-specific features) won't work locally. Use it for YAML validation and basic logic testing, not as a 1:1 replacement.

6. VS Code Extension

Install the GitHub Actions extension (github.vscode-github-actions) for live YAML validation, auto-complete for action inputs, and the ability to trigger runs directly from VS Code.

7. workflow_dispatch for Quick Iteration

During development, add a manual trigger so you can test without pushing dummy commits:

yaml
on:
  workflow_dispatch:       # Manual trigger for testing
    inputs:
      debug:
        description: 'Enable debug mode'
        required: false
        type: boolean
        default: false
  push:
    branches: [main]

8. Debug Context Step

Add this step to any workflow β€” it dumps all the context variables so you can see exactly what GitHub Actions knows about the current run:

yaml
- name: Debug context
  if: runner.debug == '1'
  run: |
    echo "Event: ${{ github.event_name }}"
    echo "Ref: ${{ github.ref }}"
    echo "SHA: ${{ github.sha }}"
    echo "Actor: ${{ github.actor }}"
    echo "Workspace: ${{ github.workspace }}"
    echo "Runner OS: ${{ runner.os }}"
    echo "Runner Arch: ${{ runner.arch }}"
    env
πŸ’‘
Tip

The if: runner.debug == '1' condition means this step only runs when debug logging is enabled. You can leave it in your production workflows permanently β€” it costs nothing during normal runs.

πŸ› Scenario 1 β€” YAML Syntax Errors

Symptom: You see "Invalid workflow file" in the GitHub UI. The workflow shows an error badge, or worse β€” it doesn't appear in the Actions tab at all. No run is triggered.

Break-and-Fix Lab

The following workflow has five deliberate errors. Read through it carefully and try to spot all of them before scrolling to the fix:

yaml
# ❌ BROKEN β€” Can you spot ALL 5 errors?
name: CI
on:
  push:
	  branches: [main]           # Error 1: TAB character used for indentation
  pull_request
    types: [opened]              # Error 2: Missing colon after pull_request

jobs:
  build:
    runs-on: ubuntu              # Error 3: Invalid runner label (should be ubuntu-latest)
    steps:
      - uses: actions/checkout@v4
      - run: echo "status": ready    # Error 4: Colon in unquoted string value
      - name: Set output
        run: echo "value=test" >> $GITHUB_OUTPUT
      - name: Use output
        run: echo ${{ steps.set-output.outputs.value }}  # Error 5: Step has no id

Now here's the fixed version with every correction annotated:

yaml
# βœ… FIXED β€” All 5 errors corrected
name: CI
on:
  push:
    branches: [main]             # Fix 1: Use SPACES (2-space indent), never tabs
  pull_request:                  # Fix 2: Added the missing colon
    types: [opened]

jobs:
  build:
    runs-on: ubuntu-latest       # Fix 3: Use full label "ubuntu-latest"
    steps:
      - uses: actions/checkout@v4
      - run: echo "status ready"     # Fix 4: Removed the problematic colon, or quote the whole string
      - name: Set output
        id: set-output               # Fix 5: Added the id so it can be referenced
        run: echo "value=test" >> $GITHUB_OUTPUT
      - name: Use output
        run: echo ${{ steps.set-output.outputs.value }}

5 Common YAML Pitfalls

Pitfall❌ Brokenβœ… Fixed
Tabs vs spaces β‡₯branches: [main] branches: [main] (2 spaces)
Missing colons pull_request pull_request:
Special chars in values run: echo "a": b run: echo "a:b" or run: 'echo "a": b'
Boolean gotcha on: true (parsed as boolean) "on": true or on: [push]
Multiline strings run: line1\nline2 Use run: | followed by indented lines
πŸ’‘
Pro Tip

Validate your YAML locally before pushing: yq eval '.on' .github/workflows/ci.yml. If it errors, you have a syntax problem. VS Code also underlines YAML errors in real time with the GitHub Actions extension installed.

πŸ› Scenario 2 β€” Workflow Never Triggers

Symptom: You push code, but no workflow run appears in the Actions tab. No error, no badge β€” nothing happens.

This is one of the most frustrating problems because there's no error message to debug. Work through this checklist top-to-bottom:

CheckRoot CauseFix
File path Workflow not in .github/workflows/ Move file to the exact path .github/workflows/ci.yml
File extension Using .txt or no extension Rename to .yml or .yaml β€” only these are recognized
Branch filter branches: [main] but pushing to develop Add the branch or remove filter: branches: [main, develop]
Path filter paths: ['src/**'] but you only changed README.md Adjust path filter or add paths-ignore instead
Workflow disabled Actions disabled in repo settings, or workflow individually paused Go to Actions tab β†’ click disabled workflow β†’ "Enable workflow"
Fork PR restrictions First-time contributor from a fork, Actions requires approval Go to the PR β†’ click "Approve and run" for first-time contributors
YAML parse error Silent failure β€” GitHub can't parse the YAML so it ignores it Validate with yq or the VS Code GitHub Actions extension
Repo is a fork Actions are disabled by default on forks Go to the fork's Actions tab β†’ click "I understand, enable Actions"
Event type mismatch on: pull_request but you pushed directly (no PR) Add push trigger or open a PR
πŸ”‘
Important

The most common reason workflows silently don't trigger: the workflow file itself has a YAML syntax error. GitHub won't show an error in the Actions tab β€” it simply won't recognize the file. Always validate locally first.

πŸ› Scenario 3 β€” Permission Denied / 403 Errors

Symptom: You see one of these error messages:

text
Error: Resource not accessible by integration
HttpError: 403 Forbidden
Error: denied: requested access to the resource is denied
Error: The token provided does not have the required permissions

Every one of these means the GITHUB_TOKEN (or your custom token) doesn't have permission to do what your workflow is trying to do.

Common Causes and Fixes

Case 1: Missing permissions Block

Since 2023, new repositories default to read-only GITHUB_TOKEN permissions. If your workflow writes anything (PRs, packages, deployments), you must declare permissions explicitly:

yaml
# ❌ No permissions declared β€” defaults to read-only
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: gh pr comment --body "Deployed!" # Fails: 403

# βœ… Explicit permissions
permissions:
  pull-requests: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: gh pr comment --body "Deployed!" # Works

Case 2: Pushing Docker Images Without packages: write

yaml
permissions:
  contents: read
  packages: write    # Required for pushing to GHCR

Case 3: OIDC Token Request Without id-token: write

yaml
permissions:
  contents: read
  id-token: write    # Required for Azure/AWS/GCP OIDC login

Case 4: Fork PR with Restricted Token

Pull requests from forks get a read-only token by default β€” even if your workflow declares write permissions. This is a security feature. You cannot override it.

⚠️
Security Note

Never use pull_request_target with actions/checkout@v4 pointing to the fork's code to "work around" the fork permission restriction. This is a critical security vulnerability β€” the fork's code runs with write access to your repository. Use pull_request (safe, read-only) and handle writes in a separate workflow triggered by a comment or label.

Case 5: Organization Policy Restriction

Org admins can restrict which permissions Actions workflows can request. If your workflow needs packages: write but the org policy caps it at read, you'll get a 403. Fix: Ask your org admin to update the Actions permissions policy under Organization Settings β†’ Actions β†’ General β†’ Workflow permissions.

Full permissions Reference

yaml
# All available permissions (set individually as needed)
permissions:
  actions: read|write|none
  checks: read|write|none
  contents: read|write|none
  deployments: read|write|none
  id-token: write|none
  issues: read|write|none
  packages: read|write|none
  pages: read|write|none
  pull-requests: read|write|none
  repository-projects: read|write|none
  security-events: read|write|none
  statuses: read|write|none

πŸ› Scenario 4 β€” Secrets Issues

Symptom: Secret is empty, secret appears in logs, or secret comes from the wrong scope.

πŸ”‘
Critical Behavior

When you reference a secret that doesn't exist β€” ${{ secrets.TYPO }} β€” GitHub Actions returns an empty string. It does NOT throw an error. This is by far the #1 cause of "my secret isn't working" issues.

Case 1: Secret Name Typo

yaml
# ❌ Typo β€” returns empty string, NO error
- run: echo ${{ secrets.ACR_PASWORD }}

# βœ… Correct name
- run: echo ${{ secrets.ACR_PASSWORD }}

Case 2: Secret Not Available in Fork PRs

This is by design. Pull requests from forks cannot access repository secrets β€” this prevents malicious forks from stealing your credentials. The secret resolves to an empty string.

Case 3: Secret Exposed in URL

yaml
# ❌ DANGEROUS β€” URL is logged, secret is visible!
- run: git clone https://user:${{ secrets.TOKEN }}@github.com/org/repo.git

# βœ… Safe β€” use environment variable, mask is preserved
- run: git clone https://user:${TOKEN}@github.com/org/repo.git
  env:
    TOKEN: ${{ secrets.TOKEN }}

When a secret is interpolated directly into a run: command, the value is injected before the shell sees it. If that value appears in a URL and the URL gets logged (e.g., by git), the masking system may not catch it because the value appears as part of a larger string.

Case 4: Secret Exposed via Base64 / Transformation

yaml
# ❌ Encoded value is logged β€” masking only hides the original
- run: echo ${{ secrets.KEY }} | base64

# βœ… If you must encode, suppress output
- run: echo ${{ secrets.KEY }} | base64 > encoded.txt

Case 5: Environment Secret Not Loaded

yaml
# ❌ Environment secrets require the "environment" key
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - run: echo ${{ secrets.PROD_KEY }}    # Empty!

# βœ… Declare the environment
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production                   # Now PROD_KEY is available
    steps:
      - run: echo ${{ secrets.PROD_KEY }}

Case 6: Organization Secret Not Shared

Organization-level secrets must be explicitly shared with specific repositories. Go to Organization Settings β†’ Secrets and variables β†’ Actions β†’ click the secret β†’ check that your repo is in the "Repository access" list.

Secret Debugging Workflow

Use this reusable step to safely verify whether secrets are set without leaking them:

yaml
- name: Verify secrets are set
  run: |
    errors=0
    if [ -z "$ACR_PASSWORD" ]; then
      echo "::error::ACR_PASSWORD secret is not set!"
      errors=$((errors + 1))
    else
      echo "βœ… ACR_PASSWORD is set (length: ${#ACR_PASSWORD})"
    fi
    if [ -z "$KUBE_CONFIG" ]; then
      echo "::error::KUBE_CONFIG secret is not set!"
      errors=$((errors + 1))
    else
      echo "βœ… KUBE_CONFIG is set (length: ${#KUBE_CONFIG})"
    fi
    if [ $errors -gt 0 ]; then
      echo "::error::$errors secret(s) missing. Check repository settings."
      exit 1
    fi
  env:
    ACR_PASSWORD: ${{ secrets.ACR_PASSWORD }}
    KUBE_CONFIG: ${{ secrets.KUBE_CONFIG }}
πŸ’‘
Why pass secrets through env:?

Passing secrets via env: (instead of inline ${{ secrets.X }}) ensures the masking engine always recognizes the value. It also avoids shell injection β€” a malicious secret value containing ; rm -rf / would be treated as a literal string in an environment variable, not as a shell command.

πŸ› Scenario 5 β€” Docker Build / Push Failures

Symptom: "ERROR: failed to solve", "denied: requested access to the resource is denied", or the build takes 30+ minutes.

Case 1: Dockerfile Not Found

text
ERROR: failed to solve: failed to read dockerfile: open Dockerfile: no such file or directory

Cause: The context or file parameter in your Docker build action points to the wrong path.

yaml
# ❌ Dockerfile is in ./app/ but context is root
- uses: docker/build-push-action@v5
  with:
    context: .
    file: ./Dockerfile       # File not in root!

# βœ… Correct path
- uses: docker/build-push-action@v5
  with:
    context: ./app
    file: ./app/Dockerfile

Case 2: Build Fails in CI but Works Locally

Three common causes:

Case 3: Push Denied

text
denied: requested access to the resource is denied

Causes:

Case 4: Slow Builds (No Layer Caching)

Without caching, Docker rebuilds every layer from scratch on every CI run. Add GitHub Actions cache:

yaml
- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}
    cache-from: type=gha
    cache-to: type=gha,mode=max

This caches Docker layers in GitHub Actions cache. Subsequent builds only rebuild changed layers, cutting build times from 10+ minutes to under 1 minute.

Case 5: Multi-Platform Build Errors

If you need both linux/amd64 and linux/arm64 images, set up QEMU first:

yaml
- uses: docker/setup-qemu-action@v3
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
  with:
    platforms: linux/amd64,linux/arm64
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}

πŸ› Scenario 6 β€” Helm / AKS Deployment Failures

Symptom: "UPGRADE FAILED", "timed out waiting for the condition", "ImagePullBackOff"

Case 1: helm upgrade --install Timeout

text
Error: UPGRADE FAILED: timed out waiting for the condition

This almost always means the new pods are crash-looping and never become Ready. Debug steps:

bash
# 1. Check pod status
kubectl get pods -n production -l app=myapp

# 2. Check why the pod is failing
kubectl describe pod <pod-name> -n production

# 3. Check application logs
kubectl logs <pod-name> -n production --previous

# 4. Check events for the namespace
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

Case 2: ImagePullBackOff

text
Warning  Failed   Back-off pulling image "myacr.azurecr.io/myapp:abc123"

Root causes:

Case 3: Wrong Helm Values

yaml
# ❌ Common mistakes in CI
helm upgrade --install myapp ./charts/myapp \
  --set image.tag=${{ github.sha }}          # Missing quotes around SHA!
  --namespace production                      # Missing --create-namespace on first deploy

# βœ… Correct
helm upgrade --install myapp ./charts/myapp \
  --set image.tag="${{ github.sha }}" \
  --namespace production \
  --create-namespace \
  --wait \
  --timeout 5m

Case 4: --atomic Causes Silent Rollback

When --atomic is set, Helm automatically rolls back on failure β€” but the error output is minimal. To see why it failed, add --debug:

bash
helm upgrade --install myapp ./charts/myapp \
  --atomic \
  --debug \
  --timeout 5m \
  --namespace production 2>&1 | tee helm-output.log

Case 5: Resource Quota Exceeded

text
Error creating: pods "myapp-xyz" is forbidden: exceeded quota: default-quota,
requested: cpu=500m, used: cpu=900m, limited: cpu=1000m

Fix: Either reduce the pod's resource requests in values.yaml, or ask the cluster admin to increase the ResourceQuota for the namespace.

Case 6: Ingress Not Working

The deploy succeeds, pods are Running, but the app isn't accessible externally. Checklist:

πŸ› Scenario 7 β€” Runner Issues

Symptom: "Waiting for a runner to pick up this job…" β€” the job sits queued for minutes or hours, or you see "No runner matching the specified labels was found."

Case 1: Wrong Runner Label

yaml
# ❌ Label doesn't match any runner
runs-on: self-hosted-linux

# βœ… Labels are comma-separated (array), not hyphenated
runs-on: [self-hosted, linux]

Case 2: Self-Hosted Runner Offline

Check the runner status at Settings β†’ Actions β†’ Runners. If the runner shows "Offline":

Case 3: GitHub-Hosted Runner Capacity

During peak times, GitHub-hosted runners may take longer to provision. If jobs are queued for more than 5 minutes:

Case 4: Job Exceeds Maximum Runtime

yaml
jobs:
  build:
    runs-on: ubuntu-latest
    timeout-minutes: 15          # Default is 360 (6 hours). Set a sane limit.
    steps:
      - uses: actions/checkout@v4
      - run: npm test
πŸ’‘
Tip

Always set timeout-minutes on your jobs. Without it, a hanging process (like a test waiting for user input) will burn 6 hours of runner time before GitHub kills it. For most CI jobs, 15–30 minutes is a generous limit.

πŸ“Š Debugging Decision Tree

When a workflow fails, walk this decision tree from top to bottom to find your issue category fast:

text
Workflow failed?
β”‚
β”œβ”€ No run appeared in Actions tab
β”‚   β”œβ”€ File not in .github/workflows/?         β†’ Move it
β”‚   β”œβ”€ File extension not .yml/.yaml?           β†’ Rename it
β”‚   β”œβ”€ YAML syntax error?                       β†’ Validate with yq or VS Code
β”‚   β”œβ”€ Branch/path filter doesn't match?        β†’ Update trigger filters
β”‚   β”œβ”€ Actions disabled on repo/fork?           β†’ Enable in Settings
β”‚   └─ Wrong event type?                        β†’ Check on: trigger
β”‚
β”œβ”€ Run appeared, but job was skipped
β”‚   β”œβ”€ if: condition evaluated to false?        β†’ Check expression logic
β”‚   └─ needs: dependency job failed?            β†’ Fix the upstream job first
β”‚
β”œβ”€ Job started, but a step failed
β”‚   β”œβ”€ Permission error (403)?                  β†’ Add permissions: block
β”‚   β”œβ”€ Secret is empty?                         β†’ Check name, scope, fork policy
β”‚   β”œβ”€ Docker build/push error?                 β†’ Check login, Dockerfile, context
β”‚   β”œβ”€ Helm/K8s deployment error?               β†’ Check kubeconfig, values, image
β”‚   β”œβ”€ Test failure?                            β†’ Check test config, service containers
β”‚   └─ Timeout?                                 β†’ Set timeout-minutes, check for hangs
β”‚
└─ Job succeeded, but result is wrong
    β”œβ”€ Output not passed between jobs?          β†’ Check outputs:, needs: syntax
    β”œβ”€ Artifact missing?                        β†’ Check upload/download action versions
    β”œβ”€ Wrong environment?                       β†’ Verify environment: key
    └─ Cached stale data?                       β†’ Clear cache or change key

πŸ“‹ Quick Reference: Error Messages

Bookmark this table. It maps the exact error messages you'll see to their most likely cause and quickest fix.

Error MessageLikely CauseQuick Fix
Invalid workflow file YAML syntax error (tabs, missing colons) Validate with yq or VS Code extension
Resource not accessible by integration GITHUB_TOKEN missing write permissions Add permissions: block to workflow
HttpError: 403 Token scope insufficient for the API call Check required permission for the specific API
denied: requested access to the resource is denied Not logged in to container registry, or wrong registry URL Add docker/login-action@v3 before push step
No runner matching the specified labels Runner label mismatch or self-hosted runner offline Verify runs-on: label matches available runners
Process completed with exit code 1 Shell command returned non-zero exit code Check command output above the error line
UPGRADE FAILED: timed out waiting for the condition Helm deploy pods are crash-looping or not becoming Ready kubectl describe pod and kubectl logs
ImagePullBackOff AKS can't pull image from registry az aks check-acr / attach ACR / verify tag
Error: Process completed with exit code 128 Git authentication failure (clone, push, fetch) Check GITHUB_TOKEN or PAT has repo access
Could not find artifact Artifact upload/download name mismatch or expired (90 days) Verify names match between upload and download steps
The template is not valid Expression syntax error in ${{ }} Check for unclosed brackets, invalid function names
Unrecognized named-value: 'env' Using env. context where it isn't available Environment variables aren't available in if: at job level β€” use vars. instead
JsonPayloadError: request entity too large Artifact or annotation payload exceeds size limit Reduce payload size; annotations have a 64KB limit
Error: Timeout has been exceeded Step or job exceeded timeout-minutes Increase timeout or optimize the slow step
The action requires a node20 runtime Using an outdated action version with deprecated Node.js Update the action to the latest version (@v4)
Input required and not supplied: token Action expects an input that isn't provided Check the action's README for required with: inputs

πŸ§ͺ Hands-on Labs

Lab 1: Fix All 5 Errors in a Broken Workflow

Copy this broken workflow into your repo as .github/workflows/lab1.yml and fix all the errors until it runs green:

yaml
# Lab 1: Fix all 5 errors
name: Lab 1 Broken
on:
  push:
    branches: [main]

jobs:
  greet:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: 18

      - name: Print Greeting
        run: echo "Hello, ${{ github.actor }}"

      - name: Generate Report  
        run: |
          echo "report=success" >> $GITHUB_OUTPUT

      - name: Read Report
        run: echo "Report: ${{ steps.generate.outputs.report }}"
πŸ’‘
Hints
  • One step is trying to read an output from another stepβ€”how does it know which step?
  • The output name used in the reference must match the id: of the producing step.
  • Try adding id: generate to the "Generate Report" step.

Lab 2: Debug a "Secret Not Found" Scenario

Create a workflow that uses a secret called MY_API_KEY but intentionally misconfigure it. Then fix it step by step:

  1. Create a repository secret named MY_API_KEY with value sk-test-123.
  2. Create this workflow (it has 3 bugs related to secrets):
yaml
name: Lab 2 Secret Debug
on: workflow_dispatch

jobs:
  use-secret:
    runs-on: ubuntu-latest
    # Bug 1: Missing "environment: staging" β€” if secret is environment-scoped
    steps:
      - name: Print key length
        # Bug 2: Using secrets.MY_API_KEYS (typo β€” extra S)
        run: |
          if [ -z "${{ secrets.MY_API_KEYS }}" ]; then
            echo "Secret is empty!"
          else
            echo "Secret is set"
          fi

      - name: Use key in URL
        # Bug 3: Secret in URL will be logged
        run: curl "https://api.example.com?key=${{ secrets.MY_API_KEY }}"

Fix all 3 bugs: correct the secret name, add environment: if needed, and pass the secret via env: instead of inline.

Lab 3: Fix a Docker Build That Works Locally but Fails in CI

Your Dockerfile uses a build argument for the API URL, but CI doesn't pass it:

dockerfile
# Dockerfile
FROM node:20-alpine
ARG API_URL
ENV API_URL=$API_URL
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build    # Fails because API_URL is undefined during build
yaml
# Workflow step β€” missing build-args
- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}

Fix: Add the build-args parameter:

yaml
- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}
    build-args: |
      API_URL=https://api.example.com

Lab 4: Diagnose a Helm Deployment Timeout

Your workflow shows UPGRADE FAILED: timed out waiting for the condition. Walk through this debugging sequence in order:

bash
# Step 1: Check what Helm sees
helm list -n production
helm history myapp -n production

# Step 2: Check pod status
kubectl get pods -n production -l app=myapp
# Look for: CrashLoopBackOff, ImagePullBackOff, Pending

# Step 3: If CrashLoopBackOff β€” check app logs
kubectl logs -l app=myapp -n production --previous --tail=50

# Step 4: If ImagePullBackOff β€” check image details
kubectl describe pod -l app=myapp -n production | grep -A5 "Events"

# Step 5: If Pending β€” check resource availability
kubectl describe pod -l app=myapp -n production | grep -A5 "Conditions"
kubectl top nodes

# Step 6: Roll back to the last working version
helm rollback myapp -n production

πŸ“ Bonus: Workflow Annotations

Use these special commands in your run: steps to create annotations that appear directly on the workflow run summary and in pull request checks:

yaml
- name: Check code quality
  run: |
    # Error β€” creates a red ❌ annotation
    echo "::error file=src/app.js,line=42,col=5::Undefined variable 'config'"
    
    # Warning β€” creates a yellow ⚠️ annotation
    echo "::warning file=src/app.js,line=10::Consider using const instead of let"
    
    # Notice β€” creates a blue ℹ️ annotation
    echo "::notice::Build completed in 45 seconds"
    
    # Group β€” collapses log output into a named section
    echo "::group::Test Results"
    cat test-results.txt
    echo "::endgroup::"
πŸ’‘
Tip

The ::error:: annotation format with file= and line= parameters adds inline annotations to pull request diffs β€” just like a linter. Use this in custom validation scripts to give developers precise, file-level feedback.

πŸ’¬ Interview Questions

Beginner (Conceptual)

Q1. How do you enable debug logging in GitHub Actions?β–Ό

A: Three ways: (1) Create a repository secret ACTIONS_STEP_DEBUG set to true for step-level verbose logging, (2) Create ACTIONS_RUNNER_DEBUG set to true for runner-level diagnostics, (3) Re-run a failed workflow and check the "Enable debug logging" checkbox β€” this enables debug for that single re-run without creating secrets. Debug mode increases log volume significantly but shows internal action details, environment variable resolution, and every command being executed.

Q2. What happens when you reference a secret that doesn't exist or has a typo? Does the workflow fail?β–Ό

A: No, the workflow does not fail. GitHub Actions returns an empty string for any undefined secret. This is a deliberate design choice to avoid leaking information about which secrets exist. The consequence is that a typo like secrets.ACR_PASWORD (missing S) silently returns empty, and your step may succeed but use blank credentials β€” causing confusing downstream failures. Best practice: add a verification step that checks -z "$SECRET_NAME" and explicitly fails if a required secret is empty.

Q3. What is the ::error:: workflow command format used for?β–Ό

A: It creates annotations on the workflow run. The full format is ::error file={path},line={line},col={col}::{message}. When used with the file parameter, it adds inline annotations to pull request diffs β€” similar to how linters highlight issues. There are three severity levels: ::error:: (red), ::warning:: (yellow), and ::notice:: (blue). You can also use ::group::Name and ::endgroup:: to create collapsible log sections.

Q4. Why might a workflow that works on the main repo not appear at all on a fork?β–Ό

A: GitHub Actions are disabled by default on forks. The fork owner must go to the Actions tab and click "I understand my workflows, go ahead and enable them." Additionally, if they're trying to trigger workflows via pull request to the upstream repo, fork PRs from first-time contributors require manual approval from a maintainer. Scheduled workflows (on: schedule) also only run on the default branch of the original repo, not on forks.

Q5. What is the default GITHUB_TOKEN permission since 2023 for new repositories?β–Ό

A: Read-only for contents and metadata. All other permissions (packages, pull-requests, issues, etc.) default to none. This was changed from the previous default of broad write access as a security hardening measure. Any workflow that needs to write β€” push packages, comment on PRs, create releases β€” must explicitly declare permissions using the permissions: key at the workflow or job level.

Intermediate (Technical)

Q6. How do you debug an ImagePullBackOff error in a GitHub Actions AKS deployment?β–Ό

A: Step-by-step: (1) kubectl describe pod <name> β€” look at the Events section for the exact pull error, (2) Verify the image exists: az acr repository show-tags -n myacr --repository myapp, (3) Check AKS-to-ACR authentication: az aks check-acr --name myaks --resource-group myrg --acr myacr.azurecr.io, (4) If authentication fails, attach ACR: az aks update -n myaks -g myrg --attach-acr myacr, (5) Check if the image tag in Helm values matches what CI actually pushed β€” a common bug is the SHA tag in the workflow not being passed correctly to --set image.tag.

Q7. Your workflow uses actions/cache but builds are still slow. What could be wrong?β–Ό

A: Common causes: (1) The cache key changes every run (e.g., includes a timestamp), so it never hits, (2) The cache path doesn't match where the tool actually stores files β€” for npm it should be ~/.npm, not node_modules/, (3) The cache was evicted β€” GitHub limits cache to 10 GB per repo and evicts least-recently-used entries, (4) For Docker builds, you're using actions/cache instead of the native BuildKit cache (cache-from: type=gha), which is more efficient, (5) The restore-keys fallback pattern is too broad, restoring an incompatible old cache that gets discarded during install anyway.

Q8. What is the difference between pull_request and pull_request_target trigger events, and why is one dangerous?β–Ό

A: pull_request runs workflow code from the PR head branch (the fork's code) but with a read-only token and no access to secrets. pull_request_target runs workflow code from the base branch (your repo's main) but with a write token and full secret access. The danger: if you use pull_request_target and then actions/checkout to check out the PR's code (ref: ${{ github.event.pull_request.head.sha }}), a malicious fork can modify the workflow to exfiltrate secrets. The rule: never check out untrusted code in a pull_request_target workflow.

Q9. A step fails with "Process completed with exit code 1" but there's no useful error. How do you get more details?β–Ό

A: (1) Re-run with debug logging enabled for verbose output, (2) Add set -x at the top of the run: block to print every command before execution, (3) Add set -euo pipefail to make the shell fail on the exact line that errors instead of continuing, (4) Check if the command writes to stderr β€” GitHub sometimes truncates stderr output. Redirect with 2>&1 to merge streams, (5) If using a third-party action, check the action's source code on GitHub β€” many actions swallow error details in their catch blocks.

Q10. How can you test a workflow change without merging it to the main branch?β–Ό

A: Several approaches: (1) Add workflow_dispatch trigger and run manually from any branch via the Actions tab, (2) Use nektos/act to run locally, (3) Open a PR β€” pull_request events trigger the workflow from the PR branch, (4) Temporarily update branches: filter to include your feature branch: branches: [main, my-feature], (5) Use a draft PR if you don't want reviews but need the workflow to run. Remember that on: push with branches: [main] will only trigger on pushes to main β€” it won't trigger on your feature branch unless you add it.

Scenario-Based (Advanced)

Q11. A colleague reports that their workflow runs on their fork but not on the main repo. What are the possible causes?β–Ό

A: This is a reversed version of the typical problem. Possible causes: (1) Actions disabled on the main repo β€” check Settings β†’ Actions β†’ General, (2) Branch protection rules on the main repo requireapproval to run workflows, (3) The workflow's on: trigger is configured differently in the main repo's branch β€” perhaps the main branch has an older version of the workflow file, (4) Organization policy restricts which workflows can run β€” check org-level Actions settings, (5) The colleague's fork has a workflow_dispatch trigger and they're running it manually, but the main repo only triggers on push and they're not pushing to it, (6) Concurrency control on the main repo is cancelling or queuing their runs behind other deployments.

Q12. Your production deploy workflow ran successfully in Actions but the app shows the old version. Walk through your debugging process.β–Ό

A: Systematic approach: (1) Verify Helm release: helm list -n production β€” is the revision number incremented? Check helm history myapp -n production, (2) Check the image tag: kubectl get deployment myapp -n production -o jsonpath='{.spec.template.spec.containers[0].image}' β€” does it show the new SHA?, (3) Check rollout status: kubectl rollout status deployment/myapp -n production β€” did the new pods actually replace the old ones?, (4) Check pod age: kubectl get pods -n production β€” are the pods recently created or still the old ones?, (5) DNS/CDN caching: the deploy may be correct but a CDN or browser is serving cached content. Check with curl -H 'Cache-Control: no-cache', (6) Wrong namespace: the deploy went to staging instead of production β€” a common --namespace mix-up, (7) Ingress routing: the old version is still served because the Ingress hasn't updated. Check kubectl describe ingress -n production.

Q13. Your matrix build passes for Node 18 and 20 but fails for Node 16 with a cryptic error. The error message is unhelpful. How do you isolate the issue?β–Ό

A: (1) Re-run only the failed matrix job with debug logging to get verbose output, (2) Check Node 16 EOL status β€” Node 16 reached end-of-life in September 2023, and many packages drop support for it. Check if a dependency updated and dropped Node 16 compatibility, (3) Lock the dependency versions: run npm ci (not npm install) with exactly the same package-lock.json locally on Node 16 to reproduce, (4) Check for syntax issues: Node 16 doesn't support some ES2022+ features like structuredClone, top-level await in CommonJS, or the Array.findLast() method, (5) Add a diagnostic step before the failing command: node --version && npm --version && npm ls to verify the exact environment, (6) Consider dropping Node 16 from the matrix since it's past EOL β€” this is often the correct fix.

Q14. Your workflow intermittently fails β€” it passes 70% of the time and fails 30%. How do you approach debugging a flaky workflow?β–Ό

A: Flaky workflows usually fall into these categories: (1) Race conditions in tests: tests depend on timing, service startup, or order of execution. Fix with proper wait/retry logic and avoid sleep in favor of polling, (2) Rate-limited API calls: external APIs return 429 errors under load. Add retry with exponential backoff, (3) Resource contention: tests use shared resources like ports or databases. Use random ports and isolated test databases, (4) Network instability: package downloads, Docker pulls, or external service calls fail intermittently. Add retry logic or use caching to avoid network calls, (5) Runner resource exhaustion: parallel tests consume all available memory/CPU. Reduce parallelism or use a larger runner. To diagnose: download logs from multiple runs, diff the failing and passing logs, and look for the first line that diverges.

Q15. A developer accidentally committed a secret to the repo and it was exposed in a workflow log. Walk through the incident response and prevention steps.β–Ό

A: Response (immediate): (1) Revoke the secret immediately β€” rotate the API key, password, or token. Assume it's compromised, (2) Remove from git history: use git filter-branch or BFG Repo Cleaner to purge the commit, then force-push. Contact GitHub support to clear cached views, (3) Delete the workflow run log that exposed the value: Actions tab β†’ click the run β†’ gear icon β†’ "Delete all logs", (4) Audit access: check if the secret was used during the exposure window. Prevention: (1) Enable GitHub Secret Scanning (free for public repos, available with GHAS for private), (2) Add .gitignore rules for .env, .secrets, etc., (3) Use pre-commit hooks like detect-secrets to block commits containing secrets, (4) Use repository/environment secrets instead of hardcoded values, (5) Add trufflehog or gitleaks to your CI pipeline to catch leaks before merge.

πŸ“ Summary

← Back to GitHub Actions Course