Hands-on Lesson 13 of 14

Debugging Workflows

Master the art of diagnosing and fixing failed GitHub Actions workflows — from YAML syntax errors to production deployment failures.

🧒 Simple Explanation (ELI5)

Imagine your car breaks down on the highway. You don't replace the entire engine — that would be insane. Instead, you follow a diagnostic process:

Check the dashboard warning light — that's the error message in your workflow run.
Open the hood — that's expanding the workflow logs in the Actions tab.
Test each system one by one — that's isolating the failing step and reading its output.
Find the broken part — that's the root cause: a bad secret, a missing permission, a YAML typo.
Fix it, close the hood, and drive — push the fix, watch the workflow go green.

A junior developer sees a red ❌ and panics. A senior developer sees a red ❌ and thinks: "Great, the system is telling me exactly where to look." This page teaches you that systematic diagnostic process so you never panic at a failed workflow again.

🛠️ Debugging Toolkit

Before diving into specific scenarios, arm yourself with these essential tools. Every GitHub Actions developer should know these exist.

1. Enable Debug Logging (Step-Level)

Go to Settings → Secrets and variables → Actions → New repository secret and create:

text

Name:  ACTIONS_STEP_DEBUG
Value: true

This enables verbose output for every step — you'll see every command being run, environment variable resolution, and internal action details. The logs grow 5–10× larger but contain the exact line that failed.

2. Enable Runner Diagnostic Logging

text

Name:  ACTIONS_RUNNER_DEBUG
Value: true

This shows runner-level diagnostics — job setup, Docker layer pulls, cache resolution, and internal runner operations. Useful when the issue isn't in your code but in the runner environment itself.

3. Re-run with Debug Logging (No Secrets Needed)

On any failed workflow run, click "Re-run all jobs" → check the "Enable debug logging" checkbox. This is the quickest way to enable debug logs without creating secrets. The logs apply only to that single re-run.

4. Download Full Logs

On any workflow run page, click the gear icon ⚙️ → "Download log archive". You get a ZIP file containing the raw log for every job and step — perfect for searching with grep or sharing with teammates.

5. `act` — Run Actions Locally

Install nektos/act to run workflows on your local machine before pushing:

bash

# Install (macOS/Linux)
brew install act

# Run the default push event
act

# Run a specific job
act -j build

# Run with secrets from a .secrets file
act --secret-file .secrets

⚠️

Warning

act doesn't perfectly replicate GitHub-hosted runners. It uses Docker images that approximate the runner environment. Some actions (especially those using OIDC or runner-specific features) won't work locally. Use it for YAML validation and basic logic testing, not as a 1:1 replacement.

6. VS Code Extension

Install the GitHub Actions extension (github.vscode-github-actions) for live YAML validation, auto-complete for action inputs, and the ability to trigger runs directly from VS Code.

7. `workflow_dispatch` for Quick Iteration

During development, add a manual trigger so you can test without pushing dummy commits:

yaml

on:
  workflow_dispatch:       # Manual trigger for testing
    inputs:
      debug:
        description: 'Enable debug mode'
        required: false
        type: boolean
        default: false
  push:
    branches: [main]

8. Debug Context Step

Add this step to any workflow — it dumps all the context variables so you can see exactly what GitHub Actions knows about the current run:

yaml

- name: Debug context
  if: runner.debug == '1'
  run: |
    echo "Event: ${{ github.event_name }}"
    echo "Ref: ${{ github.ref }}"
    echo "SHA: ${{ github.sha }}"
    echo "Actor: ${{ github.actor }}"
    echo "Workspace: ${{ github.workspace }}"
    echo "Runner OS: ${{ runner.os }}"
    echo "Runner Arch: ${{ runner.arch }}"
    env

💡

Tip

The if: runner.debug == '1' condition means this step only runs when debug logging is enabled. You can leave it in your production workflows permanently — it costs nothing during normal runs.

🐛 Scenario 1 — YAML Syntax Errors

Symptom: You see "Invalid workflow file" in the GitHub UI. The workflow shows an error badge, or worse — it doesn't appear in the Actions tab at all. No run is triggered.

Break-and-Fix Lab

The following workflow has five deliberate errors. Read through it carefully and try to spot all of them before scrolling to the fix:

yaml

# ❌ BROKEN — Can you spot ALL 5 errors?
name: CI
on:
  push:
	  branches: [main]           # Error 1: TAB character used for indentation
  pull_request
    types: [opened]              # Error 2: Missing colon after pull_request

jobs:
  build:
    runs-on: ubuntu              # Error 3: Invalid runner label (should be ubuntu-latest)
    steps:
      - uses: actions/checkout@v4
      - run: echo "status": ready    # Error 4: Colon in unquoted string value
      - name: Set output
        run: echo "value=test" >> $GITHUB_OUTPUT
      - name: Use output
        run: echo ${{ steps.set-output.outputs.value }}  # Error 5: Step has no id

Now here's the fixed version with every correction annotated:

yaml

# ✅ FIXED — All 5 errors corrected
name: CI
on:
  push:
    branches: [main]             # Fix 1: Use SPACES (2-space indent), never tabs
  pull_request:                  # Fix 2: Added the missing colon
    types: [opened]

jobs:
  build:
    runs-on: ubuntu-latest       # Fix 3: Use full label "ubuntu-latest"
    steps:
      - uses: actions/checkout@v4
      - run: echo "status ready"     # Fix 4: Removed the problematic colon, or quote the whole string
      - name: Set output
        id: set-output               # Fix 5: Added the id so it can be referenced
        run: echo "value=test" >> $GITHUB_OUTPUT
      - name: Use output
        run: echo ${{ steps.set-output.outputs.value }}

5 Common YAML Pitfalls

Pitfall	❌ Broken	✅ Fixed
Tabs vs spaces	`⇥branches: [main]`	`branches: [main]` (2 spaces)
Missing colons	`pull_request`	`pull_request:`
Special chars in values	`run: echo "a": b`	`run: echo "a:b"` or `run: 'echo "a": b'`
Boolean gotcha	`on: true` (parsed as boolean)	`"on": true` or `on: [push]`
Multiline strings	`run: line1\nline2`	Use `run: \|` followed by indented lines

💡

Pro Tip

Validate your YAML locally before pushing: yq eval '.on' .github/workflows/ci.yml. If it errors, you have a syntax problem. VS Code also underlines YAML errors in real time with the GitHub Actions extension installed.

🐛 Scenario 2 — Workflow Never Triggers

Symptom: You push code, but no workflow run appears in the Actions tab. No error, no badge — nothing happens.

This is one of the most frustrating problems because there's no error message to debug. Work through this checklist top-to-bottom:

Check	Root Cause	Fix
File path	Workflow not in `.github/workflows/`	Move file to the exact path `.github/workflows/ci.yml`
File extension	Using `.txt` or no extension	Rename to `.yml` or `.yaml` — only these are recognized
Branch filter	`branches: [main]` but pushing to `develop`	Add the branch or remove filter: `branches: [main, develop]`
Path filter	`paths: ['src/**']` but you only changed `README.md`	Adjust path filter or add `paths-ignore` instead
Workflow disabled	Actions disabled in repo settings, or workflow individually paused	Go to Actions tab → click disabled workflow → "Enable workflow"
Fork PR restrictions	First-time contributor from a fork, Actions requires approval	Go to the PR → click "Approve and run" for first-time contributors
YAML parse error	Silent failure — GitHub can't parse the YAML so it ignores it	Validate with `yq` or the VS Code GitHub Actions extension
Repo is a fork	Actions are disabled by default on forks	Go to the fork's Actions tab → click "I understand, enable Actions"
Event type mismatch	`on: pull_request` but you pushed directly (no PR)	Add `push` trigger or open a PR

🔑

Important

The most common reason workflows silently don't trigger: the workflow file itself has a YAML syntax error. GitHub won't show an error in the Actions tab — it simply won't recognize the file. Always validate locally first.

🐛 Scenario 3 — Permission Denied / 403 Errors

Symptom: You see one of these error messages:

text

Error: Resource not accessible by integration
HttpError: 403 Forbidden
Error: denied: requested access to the resource is denied
Error: The token provided does not have the required permissions

Every one of these means the GITHUB_TOKEN (or your custom token) doesn't have permission to do what your workflow is trying to do.

Common Causes and Fixes

Case 1: Missing `permissions` Block

Since 2023, new repositories default to read-only GITHUB_TOKEN permissions. If your workflow writes anything (PRs, packages, deployments), you must declare permissions explicitly:

yaml

# ❌ No permissions declared — defaults to read-only
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: gh pr comment --body "Deployed!" # Fails: 403

# ✅ Explicit permissions
permissions:
  pull-requests: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: gh pr comment --body "Deployed!" # Works

Case 2: Pushing Docker Images Without `packages: write`

yaml

permissions:
  contents: read
  packages: write    # Required for pushing to GHCR

Case 3: OIDC Token Request Without `id-token: write`

yaml

permissions:
  contents: read
  id-token: write    # Required for Azure/AWS/GCP OIDC login

Case 4: Fork PR with Restricted Token

Pull requests from forks get a read-only token by default — even if your workflow declares write permissions. This is a security feature. You cannot override it.

⚠️

Security Note

Never use pull_request_target with actions/checkout@v4 pointing to the fork's code to "work around" the fork permission restriction. This is a critical security vulnerability — the fork's code runs with write access to your repository. Use pull_request (safe, read-only) and handle writes in a separate workflow triggered by a comment or label.

Case 5: Organization Policy Restriction

Org admins can restrict which permissions Actions workflows can request. If your workflow needs packages: write but the org policy caps it at read, you'll get a 403. Fix: Ask your org admin to update the Actions permissions policy under Organization Settings → Actions → General → Workflow permissions.

Full `permissions` Reference

yaml

# All available permissions (set individually as needed)
permissions:
  actions: read|write|none
  checks: read|write|none
  contents: read|write|none
  deployments: read|write|none
  id-token: write|none
  issues: read|write|none
  packages: read|write|none
  pages: read|write|none
  pull-requests: read|write|none
  repository-projects: read|write|none
  security-events: read|write|none
  statuses: read|write|none

🐛 Scenario 4 — Secrets Issues

Symptom: Secret is empty, secret appears in logs, or secret comes from the wrong scope.

🔑

Critical Behavior

When you reference a secret that doesn't exist — ${{ secrets.TYPO }} — GitHub Actions returns an empty string. It does NOT throw an error. This is by far the #1 cause of "my secret isn't working" issues.

Case 1: Secret Name Typo

yaml

# ❌ Typo — returns empty string, NO error
- run: echo ${{ secrets.ACR_PASWORD }}

# ✅ Correct name
- run: echo ${{ secrets.ACR_PASSWORD }}

Case 2: Secret Not Available in Fork PRs

This is by design. Pull requests from forks cannot access repository secrets — this prevents malicious forks from stealing your credentials. The secret resolves to an empty string.

Case 3: Secret Exposed in URL

yaml

# ❌ DANGEROUS — URL is logged, secret is visible!
- run: git clone https://user:${{ secrets.TOKEN }}@github.com/org/repo.git

# ✅ Safe — use environment variable, mask is preserved
- run: git clone https://user:${TOKEN}@github.com/org/repo.git
  env:
    TOKEN: ${{ secrets.TOKEN }}

When a secret is interpolated directly into a run: command, the value is injected before the shell sees it. If that value appears in a URL and the URL gets logged (e.g., by git), the masking system may not catch it because the value appears as part of a larger string.

Case 4: Secret Exposed via Base64 / Transformation

yaml

# ❌ Encoded value is logged — masking only hides the original
- run: echo ${{ secrets.KEY }} | base64

# ✅ If you must encode, suppress output
- run: echo ${{ secrets.KEY }} | base64 > encoded.txt

Case 5: Environment Secret Not Loaded

yaml

# ❌ Environment secrets require the "environment" key
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - run: echo ${{ secrets.PROD_KEY }}    # Empty!

# ✅ Declare the environment
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production                   # Now PROD_KEY is available
    steps:
      - run: echo ${{ secrets.PROD_KEY }}

Case 6: Organization Secret Not Shared

Organization-level secrets must be explicitly shared with specific repositories. Go to Organization Settings → Secrets and variables → Actions → click the secret → check that your repo is in the "Repository access" list.

Secret Debugging Workflow

Use this reusable step to safely verify whether secrets are set without leaking them:

yaml

- name: Verify secrets are set
  run: |
    errors=0
    if [ -z "$ACR_PASSWORD" ]; then
      echo "::error::ACR_PASSWORD secret is not set!"
      errors=$((errors + 1))
    else
      echo "✅ ACR_PASSWORD is set (length: ${#ACR_PASSWORD})"
    fi
    if [ -z "$KUBE_CONFIG" ]; then
      echo "::error::KUBE_CONFIG secret is not set!"
      errors=$((errors + 1))
    else
      echo "✅ KUBE_CONFIG is set (length: ${#KUBE_CONFIG})"
    fi
    if [ $errors -gt 0 ]; then
      echo "::error::$errors secret(s) missing. Check repository settings."
      exit 1
    fi
  env:
    ACR_PASSWORD: ${{ secrets.ACR_PASSWORD }}
    KUBE_CONFIG: ${{ secrets.KUBE_CONFIG }}

💡

Why pass secrets through env:?

Passing secrets via env: (instead of inline ${{ secrets.X }}) ensures the masking engine always recognizes the value. It also avoids shell injection — a malicious secret value containing ; rm -rf / would be treated as a literal string in an environment variable, not as a shell command.

🐛 Scenario 5 — Docker Build / Push Failures

Symptom: "ERROR: failed to solve", "denied: requested access to the resource is denied", or the build takes 30+ minutes.

Case 1: Dockerfile Not Found

text

ERROR: failed to solve: failed to read dockerfile: open Dockerfile: no such file or directory

Cause: The context or file parameter in your Docker build action points to the wrong path.

yaml

# ❌ Dockerfile is in ./app/ but context is root
- uses: docker/build-push-action@v5
  with:
    context: .
    file: ./Dockerfile       # File not in root!

# ✅ Correct path
- uses: docker/build-push-action@v5
  with:
    context: ./app
    file: ./app/Dockerfile

Case 2: Build Fails in CI but Works Locally

Three common causes:

Missing build args: You have ARG API_KEY in the Dockerfile but don't pass --build-arg in CI.
Docker version mismatch: Ubuntu runner has a different Docker/BuildKit version. Pin with docker/setup-buildx-action@v3.
Platform mismatch: You develop on Apple Silicon (arm64) but CI runs on x86_64 (amd64). Add platforms: linux/amd64 explicitly.

Case 3: Push Denied

text

denied: requested access to the resource is denied

Causes:

Not logged in — add docker/login-action@v3 before the build step.
Wrong registry URL — myacr.azurecr.io vs ghcr.io/owner.
Expired credentials — rotate the service principal password or PAT.
Missing packages: write permission for GHCR.

Case 4: Slow Builds (No Layer Caching)

Without caching, Docker rebuilds every layer from scratch on every CI run. Add GitHub Actions cache:

yaml

- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}
    cache-from: type=gha
    cache-to: type=gha,mode=max

This caches Docker layers in GitHub Actions cache. Subsequent builds only rebuild changed layers, cutting build times from 10+ minutes to under 1 minute.

Case 5: Multi-Platform Build Errors

If you need both linux/amd64 and linux/arm64 images, set up QEMU first:

yaml

- uses: docker/setup-qemu-action@v3
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
  with:
    platforms: linux/amd64,linux/arm64
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}

🐛 Scenario 6 — Helm / AKS Deployment Failures

Symptom: "UPGRADE FAILED", "timed out waiting for the condition", "ImagePullBackOff"

Case 1: `helm upgrade --install` Timeout

text

Error: UPGRADE FAILED: timed out waiting for the condition

This almost always means the new pods are crash-looping and never become Ready. Debug steps:

bash

# 1. Check pod status
kubectl get pods -n production -l app=myapp

# 2. Check why the pod is failing
kubectl describe pod <pod-name> -n production

# 3. Check application logs
kubectl logs <pod-name> -n production --previous

# 4. Check events for the namespace
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20

Case 2: ImagePullBackOff

text

Warning  Failed   Back-off pulling image "myacr.azurecr.io/myapp:abc123"

Root causes:

AKS can't authenticate to ACR → Run az aks check-acr --name myaks --resource-group myrg --acr myacr.azurecr.io
ACR not attached to AKS → Run az aks update -n myaks -g myrg --attach-acr myacr
Image tag doesn't exist → The CI pushed tag abc123 but Helm values use latest. Verify: az acr repository show-tags -n myacr --repository myapp

Case 3: Wrong Helm Values

yaml

# ❌ Common mistakes in CI
helm upgrade --install myapp ./charts/myapp \
  --set image.tag=${{ github.sha }}          # Missing quotes around SHA!
  --namespace production                      # Missing --create-namespace on first deploy

# ✅ Correct
helm upgrade --install myapp ./charts/myapp \
  --set image.tag="${{ github.sha }}" \
  --namespace production \
  --create-namespace \
  --wait \
  --timeout 5m

Case 4: `--atomic` Causes Silent Rollback

When --atomic is set, Helm automatically rolls back on failure — but the error output is minimal. To see why it failed, add --debug:

bash

helm upgrade --install myapp ./charts/myapp \
  --atomic \
  --debug \
  --timeout 5m \
  --namespace production 2>&1 | tee helm-output.log

Case 5: Resource Quota Exceeded

text

Error creating: pods "myapp-xyz" is forbidden: exceeded quota: default-quota,
requested: cpu=500m, used: cpu=900m, limited: cpu=1000m

Fix: Either reduce the pod's resource requests in values.yaml, or ask the cluster admin to increase the ResourceQuota for the namespace.

Case 6: Ingress Not Working

The deploy succeeds, pods are Running, but the app isn't accessible externally. Checklist:

Is the Ingress Controller installed? kubectl get pods -n ingress-nginx
Does the Ingress resource exist? kubectl get ingress -n production
Is cert-manager issuing the TLS certificate? kubectl describe certificate -n production
Does DNS point to the Ingress Controller's external IP? nslookup myapp.example.com

🐛 Scenario 7 — Runner Issues

Symptom: "Waiting for a runner to pick up this job…" — the job sits queued for minutes or hours, or you see "No runner matching the specified labels was found."

Case 1: Wrong Runner Label

yaml

# ❌ Label doesn't match any runner
runs-on: self-hosted-linux

# ✅ Labels are comma-separated (array), not hyphenated
runs-on: [self-hosted, linux]

Case 2: Self-Hosted Runner Offline

Check the runner status at Settings → Actions → Runners. If the runner shows "Offline":

SSH into the runner machine and check the service: sudo systemctl status actions.runner.*
Restart: sudo systemctl restart actions.runner.*
Check logs: journalctl -u actions.runner.* --since "1 hour ago"

Case 3: GitHub-Hosted Runner Capacity

During peak times, GitHub-hosted runners may take longer to provision. If jobs are queued for more than 5 minutes:

Check GitHub Status for incidents.
Consider using larger runners (GitHub Teams/Enterprise) for priority queuing.
Add a timeout-minutes to prevent jobs from hanging indefinitely.

Case 4: Job Exceeds Maximum Runtime

yaml

jobs:
  build:
    runs-on: ubuntu-latest
    timeout-minutes: 15          # Default is 360 (6 hours). Set a sane limit.
    steps:
      - uses: actions/checkout@v4
      - run: npm test

💡

Tip

Always set timeout-minutes on your jobs. Without it, a hanging process (like a test waiting for user input) will burn 6 hours of runner time before GitHub kills it. For most CI jobs, 15–30 minutes is a generous limit.

📊 Debugging Decision Tree

When a workflow fails, walk this decision tree from top to bottom to find your issue category fast:

text

Workflow failed?
│
├─ No run appeared in Actions tab
│   ├─ File not in .github/workflows/?         → Move it
│   ├─ File extension not .yml/.yaml?           → Rename it
│   ├─ YAML syntax error?                       → Validate with yq or VS Code
│   ├─ Branch/path filter doesn't match?        → Update trigger filters
│   ├─ Actions disabled on repo/fork?           → Enable in Settings
│   └─ Wrong event type?                        → Check on: trigger
│
├─ Run appeared, but job was skipped
│   ├─ if: condition evaluated to false?        → Check expression logic
│   └─ needs: dependency job failed?            → Fix the upstream job first
│
├─ Job started, but a step failed
│   ├─ Permission error (403)?                  → Add permissions: block
│   ├─ Secret is empty?                         → Check name, scope, fork policy
│   ├─ Docker build/push error?                 → Check login, Dockerfile, context
│   ├─ Helm/K8s deployment error?               → Check kubeconfig, values, image
│   ├─ Test failure?                            → Check test config, service containers
│   └─ Timeout?                                 → Set timeout-minutes, check for hangs
│
└─ Job succeeded, but result is wrong
    ├─ Output not passed between jobs?          → Check outputs:, needs: syntax
    ├─ Artifact missing?                        → Check upload/download action versions
    ├─ Wrong environment?                       → Verify environment: key
    └─ Cached stale data?                       → Clear cache or change key

📋 Quick Reference: Error Messages

Bookmark this table. It maps the exact error messages you'll see to their most likely cause and quickest fix.

Error Message	Likely Cause	Quick Fix
`Invalid workflow file`	YAML syntax error (tabs, missing colons)	Validate with `yq` or VS Code extension
`Resource not accessible by integration`	`GITHUB_TOKEN` missing write permissions	Add `permissions:` block to workflow
`HttpError: 403`	Token scope insufficient for the API call	Check required permission for the specific API
`denied: requested access to the resource is denied`	Not logged in to container registry, or wrong registry URL	Add `docker/login-action@v3` before push step
`No runner matching the specified labels`	Runner label mismatch or self-hosted runner offline	Verify `runs-on:` label matches available runners
`Process completed with exit code 1`	Shell command returned non-zero exit code	Check command output above the error line
`UPGRADE FAILED: timed out waiting for the condition`	Helm deploy pods are crash-looping or not becoming Ready	`kubectl describe pod` and `kubectl logs`
`ImagePullBackOff`	AKS can't pull image from registry	`az aks check-acr` / attach ACR / verify tag
`Error: Process completed with exit code 128`	Git authentication failure (clone, push, fetch)	Check `GITHUB_TOKEN` or PAT has repo access
`Could not find artifact`	Artifact upload/download name mismatch or expired (90 days)	Verify names match between upload and download steps
`The template is not valid`	Expression syntax error in `${{ }}`	Check for unclosed brackets, invalid function names
`Unrecognized named-value: 'env'`	Using `env.` context where it isn't available	Environment variables aren't available in `if:` at job level — use `vars.` instead
`JsonPayloadError: request entity too large`	Artifact or annotation payload exceeds size limit	Reduce payload size; annotations have a 64KB limit
`Error: Timeout has been exceeded`	Step or job exceeded `timeout-minutes`	Increase timeout or optimize the slow step
`The action requires a node20 runtime`	Using an outdated action version with deprecated Node.js	Update the action to the latest version (`@v4`)
`Input required and not supplied: token`	Action expects an input that isn't provided	Check the action's README for required `with:` inputs

🧪 Hands-on Labs

Lab 1: Fix All 5 Errors in a Broken Workflow

Copy this broken workflow into your repo as .github/workflows/lab1.yml and fix all the errors until it runs green:

yaml

# Lab 1: Fix all 5 errors
name: Lab 1 Broken
on:
  push:
    branches: [main]

jobs:
  greet:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node
        uses: actions/setup-node@v4
        with:
          node-version: 18

      - name: Print Greeting
        run: echo "Hello, ${{ github.actor }}"

      - name: Generate Report  
        run: |
          echo "report=success" >> $GITHUB_OUTPUT

      - name: Read Report
        run: echo "Report: ${{ steps.generate.outputs.report }}"

💡

Hints

One step is trying to read an output from another step—how does it know which step?
The output name used in the reference must match the id: of the producing step.
Try adding id: generate to the "Generate Report" step.

Lab 2: Debug a "Secret Not Found" Scenario

Create a workflow that uses a secret called MY_API_KEY but intentionally misconfigure it. Then fix it step by step:

Create a repository secret named MY_API_KEY with value sk-test-123.
Create this workflow (it has 3 bugs related to secrets):

yaml

name: Lab 2 Secret Debug
on: workflow_dispatch

jobs:
  use-secret:
    runs-on: ubuntu-latest
    # Bug 1: Missing "environment: staging" — if secret is environment-scoped
    steps:
      - name: Print key length
        # Bug 2: Using secrets.MY_API_KEYS (typo — extra S)
        run: |
          if [ -z "${{ secrets.MY_API_KEYS }}" ]; then
            echo "Secret is empty!"
          else
            echo "Secret is set"
          fi

      - name: Use key in URL
        # Bug 3: Secret in URL will be logged
        run: curl "https://api.example.com?key=${{ secrets.MY_API_KEY }}"

Fix all 3 bugs: correct the secret name, add environment: if needed, and pass the secret via env: instead of inline.

Lab 3: Fix a Docker Build That Works Locally but Fails in CI

Your Dockerfile uses a build argument for the API URL, but CI doesn't pass it:

dockerfile

# Dockerfile
FROM node:20-alpine
ARG API_URL
ENV API_URL=$API_URL
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build    # Fails because API_URL is undefined during build

yaml

# Workflow step — missing build-args
- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}

Fix: Add the build-args parameter:

yaml

- uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: myacr.azurecr.io/myapp:${{ github.sha }}
    build-args: |
      API_URL=https://api.example.com

Lab 4: Diagnose a Helm Deployment Timeout

Your workflow shows UPGRADE FAILED: timed out waiting for the condition. Walk through this debugging sequence in order:

bash

# Step 1: Check what Helm sees
helm list -n production
helm history myapp -n production

# Step 2: Check pod status
kubectl get pods -n production -l app=myapp
# Look for: CrashLoopBackOff, ImagePullBackOff, Pending

# Step 3: If CrashLoopBackOff — check app logs
kubectl logs -l app=myapp -n production --previous --tail=50

# Step 4: If ImagePullBackOff — check image details
kubectl describe pod -l app=myapp -n production | grep -A5 "Events"

# Step 5: If Pending — check resource availability
kubectl describe pod -l app=myapp -n production | grep -A5 "Conditions"
kubectl top nodes

# Step 6: Roll back to the last working version
helm rollback myapp -n production

📝 Bonus: Workflow Annotations

Use these special commands in your run: steps to create annotations that appear directly on the workflow run summary and in pull request checks:

yaml

- name: Check code quality
  run: |
    # Error — creates a red ❌ annotation
    echo "::error file=src/app.js,line=42,col=5::Undefined variable 'config'"
    
    # Warning — creates a yellow ⚠️ annotation
    echo "::warning file=src/app.js,line=10::Consider using const instead of let"
    
    # Notice — creates a blue ℹ️ annotation
    echo "::notice::Build completed in 45 seconds"
    
    # Group — collapses log output into a named section
    echo "::group::Test Results"
    cat test-results.txt
    echo "::endgroup::"

💡

Tip

The ::error:: annotation format with file= and line= parameters adds inline annotations to pull request diffs — just like a linter. Use this in custom validation scripts to give developers precise, file-level feedback.

💬 Interview Questions

Beginner (Conceptual)

Q1. How do you enable debug logging in GitHub Actions?▼

A: Three ways: (1) Create a repository secret ACTIONS_STEP_DEBUG set to true for step-level verbose logging, (2) Create ACTIONS_RUNNER_DEBUG set to true for runner-level diagnostics, (3) Re-run a failed workflow and check the "Enable debug logging" checkbox — this enables debug for that single re-run without creating secrets. Debug mode increases log volume significantly but shows internal action details, environment variable resolution, and every command being executed.

Q2. What happens when you reference a secret that doesn't exist or has a typo? Does the workflow fail?▼

A: No, the workflow does not fail. GitHub Actions returns an empty string for any undefined secret. This is a deliberate design choice to avoid leaking information about which secrets exist. The consequence is that a typo like secrets.ACR_PASWORD (missing S) silently returns empty, and your step may succeed but use blank credentials — causing confusing downstream failures. Best practice: add a verification step that checks -z "$SECRET_NAME" and explicitly fails if a required secret is empty.

Q3. What is the ::error:: workflow command format used for?▼

A: It creates annotations on the workflow run. The full format is ::error file={path},line={line},col={col}::{message}. When used with the file parameter, it adds inline annotations to pull request diffs — similar to how linters highlight issues. There are three severity levels: ::error:: (red), ::warning:: (yellow), and ::notice:: (blue). You can also use ::group::Name and ::endgroup:: to create collapsible log sections.

Q4. Why might a workflow that works on the main repo not appear at all on a fork?▼

A: GitHub Actions are disabled by default on forks. The fork owner must go to the Actions tab and click "I understand my workflows, go ahead and enable them." Additionally, if they're trying to trigger workflows via pull request to the upstream repo, fork PRs from first-time contributors require manual approval from a maintainer. Scheduled workflows (on: schedule) also only run on the default branch of the original repo, not on forks.

Q5. What is the default GITHUB_TOKEN permission since 2023 for new repositories?▼

A: Read-only for contents and metadata. All other permissions (packages, pull-requests, issues, etc.) default to none. This was changed from the previous default of broad write access as a security hardening measure. Any workflow that needs to write — push packages, comment on PRs, create releases — must explicitly declare permissions using the permissions: key at the workflow or job level.

Intermediate (Technical)

Q6. How do you debug an ImagePullBackOff error in a GitHub Actions AKS deployment?▼

A: Step-by-step: (1) kubectl describe pod <name> — look at the Events section for the exact pull error, (2) Verify the image exists: az acr repository show-tags -n myacr --repository myapp, (3) Check AKS-to-ACR authentication: az aks check-acr --name myaks --resource-group myrg --acr myacr.azurecr.io, (4) If authentication fails, attach ACR: az aks update -n myaks -g myrg --attach-acr myacr, (5) Check if the image tag in Helm values matches what CI actually pushed — a common bug is the SHA tag in the workflow not being passed correctly to --set image.tag.

Q7. Your workflow uses actions/cache but builds are still slow. What could be wrong?▼

A: Common causes: (1) The cache key changes every run (e.g., includes a timestamp), so it never hits, (2) The cache path doesn't match where the tool actually stores files — for npm it should be ~/.npm, not node_modules/, (3) The cache was evicted — GitHub limits cache to 10 GB per repo and evicts least-recently-used entries, (4) For Docker builds, you're using actions/cache instead of the native BuildKit cache (cache-from: type=gha), which is more efficient, (5) The restore-keys fallback pattern is too broad, restoring an incompatible old cache that gets discarded during install anyway.

Q8. What is the difference between pull_request and pull_request_target trigger events, and why is one dangerous?▼

A: pull_request runs workflow code from the PR head branch (the fork's code) but with a read-only token and no access to secrets. pull_request_target runs workflow code from the base branch (your repo's main) but with a write token and full secret access. The danger: if you use pull_request_target and then actions/checkout to check out the PR's code (ref: ${{ github.event.pull_request.head.sha }}), a malicious fork can modify the workflow to exfiltrate secrets. The rule: never check out untrusted code in a pull_request_target workflow.

Q9. A step fails with "Process completed with exit code 1" but there's no useful error. How do you get more details?▼

A: (1) Re-run with debug logging enabled for verbose output, (2) Add set -x at the top of the run: block to print every command before execution, (3) Add set -euo pipefail to make the shell fail on the exact line that errors instead of continuing, (4) Check if the command writes to stderr — GitHub sometimes truncates stderr output. Redirect with 2>&1 to merge streams, (5) If using a third-party action, check the action's source code on GitHub — many actions swallow error details in their catch blocks.

Q10. How can you test a workflow change without merging it to the main branch?▼

A: Several approaches: (1) Add workflow_dispatch trigger and run manually from any branch via the Actions tab, (2) Use nektos/act to run locally, (3) Open a PR — pull_request events trigger the workflow from the PR branch, (4) Temporarily update branches: filter to include your feature branch: branches: [main, my-feature], (5) Use a draft PR if you don't want reviews but need the workflow to run. Remember that on: push with branches: [main] will only trigger on pushes to main — it won't trigger on your feature branch unless you add it.

Scenario-Based (Advanced)

Q11. A colleague reports that their workflow runs on their fork but not on the main repo. What are the possible causes?▼

A: This is a reversed version of the typical problem. Possible causes: (1) Actions disabled on the main repo — check Settings → Actions → General, (2) Branch protection rules on the main repo requireapproval to run workflows, (3) The workflow's on: trigger is configured differently in the main repo's branch — perhaps the main branch has an older version of the workflow file, (4) Organization policy restricts which workflows can run — check org-level Actions settings, (5) The colleague's fork has a workflow_dispatch trigger and they're running it manually, but the main repo only triggers on push and they're not pushing to it, (6) Concurrency control on the main repo is cancelling or queuing their runs behind other deployments.

Q12. Your production deploy workflow ran successfully in Actions but the app shows the old version. Walk through your debugging process.▼

A: Systematic approach: (1) Verify Helm release: helm list -n production — is the revision number incremented? Check helm history myapp -n production, (2) Check the image tag: kubectl get deployment myapp -n production -o jsonpath='{.spec.template.spec.containers[0].image}' — does it show the new SHA?, (3) Check rollout status: kubectl rollout status deployment/myapp -n production — did the new pods actually replace the old ones?, (4) Check pod age: kubectl get pods -n production — are the pods recently created or still the old ones?, (5) DNS/CDN caching: the deploy may be correct but a CDN or browser is serving cached content. Check with curl -H 'Cache-Control: no-cache', (6) Wrong namespace: the deploy went to staging instead of production — a common --namespace mix-up, (7) Ingress routing: the old version is still served because the Ingress hasn't updated. Check kubectl describe ingress -n production.

Q13. Your matrix build passes for Node 18 and 20 but fails for Node 16 with a cryptic error. The error message is unhelpful. How do you isolate the issue?▼

A: (1) Re-run only the failed matrix job with debug logging to get verbose output, (2) Check Node 16 EOL status — Node 16 reached end-of-life in September 2023, and many packages drop support for it. Check if a dependency updated and dropped Node 16 compatibility, (3) Lock the dependency versions: run npm ci (not npm install) with exactly the same package-lock.json locally on Node 16 to reproduce, (4) Check for syntax issues: Node 16 doesn't support some ES2022+ features like structuredClone, top-level await in CommonJS, or the Array.findLast() method, (5) Add a diagnostic step before the failing command: node --version && npm --version && npm ls to verify the exact environment, (6) Consider dropping Node 16 from the matrix since it's past EOL — this is often the correct fix.

Q14. Your workflow intermittently fails — it passes 70% of the time and fails 30%. How do you approach debugging a flaky workflow?▼

A: Flaky workflows usually fall into these categories: (1) Race conditions in tests: tests depend on timing, service startup, or order of execution. Fix with proper wait/retry logic and avoid sleep in favor of polling, (2) Rate-limited API calls: external APIs return 429 errors under load. Add retry with exponential backoff, (3) Resource contention: tests use shared resources like ports or databases. Use random ports and isolated test databases, (4) Network instability: package downloads, Docker pulls, or external service calls fail intermittently. Add retry logic or use caching to avoid network calls, (5) Runner resource exhaustion: parallel tests consume all available memory/CPU. Reduce parallelism or use a larger runner. To diagnose: download logs from multiple runs, diff the failing and passing logs, and look for the first line that diverges.

Q15. A developer accidentally committed a secret to the repo and it was exposed in a workflow log. Walk through the incident response and prevention steps.▼

A: Response (immediate): (1) Revoke the secret immediately — rotate the API key, password, or token. Assume it's compromised, (2) Remove from git history: use git filter-branch or BFG Repo Cleaner to purge the commit, then force-push. Contact GitHub support to clear cached views, (3) Delete the workflow run log that exposed the value: Actions tab → click the run → gear icon → "Delete all logs", (4) Audit access: check if the secret was used during the exposure window. Prevention: (1) Enable GitHub Secret Scanning (free for public repos, available with GHAS for private), (2) Add .gitignore rules for .env, .secrets, etc., (3) Use pre-commit hooks like detect-secrets to block commits containing secrets, (4) Use repository/environment secrets instead of hardcoded values, (5) Add trufflehog or gitleaks to your CI pipeline to catch leaks before merge.

📝 Summary

Debugging toolkit: Enable ACTIONS_STEP_DEBUG and ACTIONS_RUNNER_DEBUG secrets, re-run with debug logging, download log archives, and use nektos/act for local testing.
YAML errors: Tabs kill workflows silently. Always use spaces. Validate locally with yq or VS Code.
Trigger issues: If no run appears, check file path, extension, branch/path filters, and YAML syntax — in that order.
Permission errors: New repos default to read-only tokens. Always add an explicit permissions: block for any workflow that writes.
Secret gotchas: Typos return empty strings (not errors). Always verify secrets with a check step. Never interpolate secrets into URLs or encode them.
Docker failures: Check context/file paths, pass build args, log in before push, and enable cache-from: type=gha for fast builds.
Helm/AKS issues: helm upgrade timeouts usually mean crash-looping pods. Use kubectl describe pod and kubectl logs --previous to find the root cause.
Runner problems: Verify runs-on: labels match available runners. Always set timeout-minutes to avoid runaway jobs.
Decision tree: No run → check triggers. Skipped → check conditions. Step failed → check the error category. Succeeded but wrong → check outputs and caching.

← Build a Full Pipeline Interview Preparation →

← Back to GitHub Actions Course

Debugging Workflows

🧒 Simple Explanation (ELI5)

🛠️ Debugging Toolkit

1. Enable Debug Logging (Step-Level)

2. Enable Runner Diagnostic Logging

3. Re-run with Debug Logging (No Secrets Needed)

4. Download Full Logs

5. act — Run Actions Locally

6. VS Code Extension

7. workflow_dispatch for Quick Iteration

8. Debug Context Step

🐛 Scenario 1 — YAML Syntax Errors

Break-and-Fix Lab

5 Common YAML Pitfalls

🐛 Scenario 2 — Workflow Never Triggers

🐛 Scenario 3 — Permission Denied / 403 Errors

Common Causes and Fixes

Case 1: Missing permissions Block

Case 2: Pushing Docker Images Without packages: write

Case 3: OIDC Token Request Without id-token: write

Case 4: Fork PR with Restricted Token

Case 5: Organization Policy Restriction

Full permissions Reference

🐛 Scenario 4 — Secrets Issues

Case 1: Secret Name Typo

Case 2: Secret Not Available in Fork PRs

Case 3: Secret Exposed in URL

Case 4: Secret Exposed via Base64 / Transformation

Case 5: Environment Secret Not Loaded

Case 6: Organization Secret Not Shared

Secret Debugging Workflow

🐛 Scenario 5 — Docker Build / Push Failures

Case 1: Dockerfile Not Found

Case 2: Build Fails in CI but Works Locally

Case 3: Push Denied

Case 4: Slow Builds (No Layer Caching)

Case 5: Multi-Platform Build Errors

🐛 Scenario 6 — Helm / AKS Deployment Failures

Case 1: helm upgrade --install Timeout

Case 2: ImagePullBackOff

Case 3: Wrong Helm Values

Case 4: --atomic Causes Silent Rollback

Case 5: Resource Quota Exceeded

Case 6: Ingress Not Working

🐛 Scenario 7 — Runner Issues

Case 1: Wrong Runner Label

Case 2: Self-Hosted Runner Offline

Case 3: GitHub-Hosted Runner Capacity

Case 4: Job Exceeds Maximum Runtime

📊 Debugging Decision Tree

📋 Quick Reference: Error Messages

🧪 Hands-on Labs

Lab 1: Fix All 5 Errors in a Broken Workflow

Lab 2: Debug a "Secret Not Found" Scenario

Lab 3: Fix a Docker Build That Works Locally but Fails in CI

Lab 4: Diagnose a Helm Deployment Timeout

📝 Bonus: Workflow Annotations

💬 Interview Questions

Beginner (Conceptual)

Intermediate (Technical)

Scenario-Based (Advanced)

📝 Summary

5. `act` — Run Actions Locally

7. `workflow_dispatch` for Quick Iteration

Case 1: Missing `permissions` Block

Case 2: Pushing Docker Images Without `packages: write`

Case 3: OIDC Token Request Without `id-token: write`

Full `permissions` Reference

Case 1: `helm upgrade --install` Timeout

Case 4: `--atomic` Causes Silent Rollback