Debugging Workflows
Master the art of diagnosing and fixing failed GitHub Actions workflows β from YAML syntax errors to production deployment failures.
π§ Simple Explanation (ELI5)
Imagine your car breaks down on the highway. You don't replace the entire engine β that would be insane. Instead, you follow a diagnostic process:
- Check the dashboard warning light β that's the error message in your workflow run.
- Open the hood β that's expanding the workflow logs in the Actions tab.
- Test each system one by one β that's isolating the failing step and reading its output.
- Find the broken part β that's the root cause: a bad secret, a missing permission, a YAML typo.
- Fix it, close the hood, and drive β push the fix, watch the workflow go green.
A junior developer sees a red β and panics. A senior developer sees a red β and thinks: "Great, the system is telling me exactly where to look." This page teaches you that systematic diagnostic process so you never panic at a failed workflow again.
π οΈ Debugging Toolkit
Before diving into specific scenarios, arm yourself with these essential tools. Every GitHub Actions developer should know these exist.
1. Enable Debug Logging (Step-Level)
Go to Settings β Secrets and variables β Actions β New repository secret and create:
Name: ACTIONS_STEP_DEBUG Value: true
This enables verbose output for every step β you'll see every command being run, environment variable resolution, and internal action details. The logs grow 5β10Γ larger but contain the exact line that failed.
2. Enable Runner Diagnostic Logging
Name: ACTIONS_RUNNER_DEBUG Value: true
This shows runner-level diagnostics β job setup, Docker layer pulls, cache resolution, and internal runner operations. Useful when the issue isn't in your code but in the runner environment itself.
3. Re-run with Debug Logging (No Secrets Needed)
On any failed workflow run, click "Re-run all jobs" β check the "Enable debug logging" checkbox. This is the quickest way to enable debug logs without creating secrets. The logs apply only to that single re-run.
4. Download Full Logs
On any workflow run page, click the gear icon βοΈ β "Download log archive". You get a ZIP file containing the raw log for every job and step β perfect for searching with grep or sharing with teammates.
5. act β Run Actions Locally
Install nektos/act to run workflows on your local machine before pushing:
# Install (macOS/Linux) brew install act # Run the default push event act # Run a specific job act -j build # Run with secrets from a .secrets file act --secret-file .secrets
act doesn't perfectly replicate GitHub-hosted runners. It uses Docker images that approximate the runner environment. Some actions (especially those using OIDC or runner-specific features) won't work locally. Use it for YAML validation and basic logic testing, not as a 1:1 replacement.
6. VS Code Extension
Install the GitHub Actions extension (github.vscode-github-actions) for live YAML validation, auto-complete for action inputs, and the ability to trigger runs directly from VS Code.
7. workflow_dispatch for Quick Iteration
During development, add a manual trigger so you can test without pushing dummy commits:
on:
workflow_dispatch: # Manual trigger for testing
inputs:
debug:
description: 'Enable debug mode'
required: false
type: boolean
default: false
push:
branches: [main]
8. Debug Context Step
Add this step to any workflow β it dumps all the context variables so you can see exactly what GitHub Actions knows about the current run:
- name: Debug context
if: runner.debug == '1'
run: |
echo "Event: ${{ github.event_name }}"
echo "Ref: ${{ github.ref }}"
echo "SHA: ${{ github.sha }}"
echo "Actor: ${{ github.actor }}"
echo "Workspace: ${{ github.workspace }}"
echo "Runner OS: ${{ runner.os }}"
echo "Runner Arch: ${{ runner.arch }}"
env
The if: runner.debug == '1' condition means this step only runs when debug logging is enabled. You can leave it in your production workflows permanently β it costs nothing during normal runs.
π Scenario 1 β YAML Syntax Errors
Symptom: You see "Invalid workflow file" in the GitHub UI. The workflow shows an error badge, or worse β it doesn't appear in the Actions tab at all. No run is triggered.
Break-and-Fix Lab
The following workflow has five deliberate errors. Read through it carefully and try to spot all of them before scrolling to the fix:
# β BROKEN β Can you spot ALL 5 errors?
name: CI
on:
push:
branches: [main] # Error 1: TAB character used for indentation
pull_request
types: [opened] # Error 2: Missing colon after pull_request
jobs:
build:
runs-on: ubuntu # Error 3: Invalid runner label (should be ubuntu-latest)
steps:
- uses: actions/checkout@v4
- run: echo "status": ready # Error 4: Colon in unquoted string value
- name: Set output
run: echo "value=test" >> $GITHUB_OUTPUT
- name: Use output
run: echo ${{ steps.set-output.outputs.value }} # Error 5: Step has no id
Now here's the fixed version with every correction annotated:
# β
FIXED β All 5 errors corrected
name: CI
on:
push:
branches: [main] # Fix 1: Use SPACES (2-space indent), never tabs
pull_request: # Fix 2: Added the missing colon
types: [opened]
jobs:
build:
runs-on: ubuntu-latest # Fix 3: Use full label "ubuntu-latest"
steps:
- uses: actions/checkout@v4
- run: echo "status ready" # Fix 4: Removed the problematic colon, or quote the whole string
- name: Set output
id: set-output # Fix 5: Added the id so it can be referenced
run: echo "value=test" >> $GITHUB_OUTPUT
- name: Use output
run: echo ${{ steps.set-output.outputs.value }}
5 Common YAML Pitfalls
| Pitfall | β Broken | β Fixed |
|---|---|---|
| Tabs vs spaces | β₯branches: [main] |
branches: [main] (2 spaces) |
| Missing colons | pull_request |
pull_request: |
| Special chars in values | run: echo "a": b |
run: echo "a:b" or run: 'echo "a": b' |
| Boolean gotcha | on: true (parsed as boolean) |
"on": true or on: [push] |
| Multiline strings | run: line1\nline2 |
Use run: | followed by indented lines |
Validate your YAML locally before pushing: yq eval '.on' .github/workflows/ci.yml. If it errors, you have a syntax problem. VS Code also underlines YAML errors in real time with the GitHub Actions extension installed.
π Scenario 2 β Workflow Never Triggers
Symptom: You push code, but no workflow run appears in the Actions tab. No error, no badge β nothing happens.
This is one of the most frustrating problems because there's no error message to debug. Work through this checklist top-to-bottom:
| Check | Root Cause | Fix |
|---|---|---|
| File path | Workflow not in .github/workflows/ |
Move file to the exact path .github/workflows/ci.yml |
| File extension | Using .txt or no extension |
Rename to .yml or .yaml β only these are recognized |
| Branch filter | branches: [main] but pushing to develop |
Add the branch or remove filter: branches: [main, develop] |
| Path filter | paths: ['src/**'] but you only changed README.md |
Adjust path filter or add paths-ignore instead |
| Workflow disabled | Actions disabled in repo settings, or workflow individually paused | Go to Actions tab β click disabled workflow β "Enable workflow" |
| Fork PR restrictions | First-time contributor from a fork, Actions requires approval | Go to the PR β click "Approve and run" for first-time contributors |
| YAML parse error | Silent failure β GitHub can't parse the YAML so it ignores it | Validate with yq or the VS Code GitHub Actions extension |
| Repo is a fork | Actions are disabled by default on forks | Go to the fork's Actions tab β click "I understand, enable Actions" |
| Event type mismatch | on: pull_request but you pushed directly (no PR) |
Add push trigger or open a PR |
The most common reason workflows silently don't trigger: the workflow file itself has a YAML syntax error. GitHub won't show an error in the Actions tab β it simply won't recognize the file. Always validate locally first.
π Scenario 3 β Permission Denied / 403 Errors
Symptom: You see one of these error messages:
Error: Resource not accessible by integration HttpError: 403 Forbidden Error: denied: requested access to the resource is denied Error: The token provided does not have the required permissions
Every one of these means the GITHUB_TOKEN (or your custom token) doesn't have permission to do what your workflow is trying to do.
Common Causes and Fixes
Case 1: Missing permissions Block
Since 2023, new repositories default to read-only GITHUB_TOKEN permissions. If your workflow writes anything (PRs, packages, deployments), you must declare permissions explicitly:
# β No permissions declared β defaults to read-only
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: gh pr comment --body "Deployed!" # Fails: 403
# β
Explicit permissions
permissions:
pull-requests: write
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: gh pr comment --body "Deployed!" # Works
Case 2: Pushing Docker Images Without packages: write
permissions: contents: read packages: write # Required for pushing to GHCR
Case 3: OIDC Token Request Without id-token: write
permissions: contents: read id-token: write # Required for Azure/AWS/GCP OIDC login
Case 4: Fork PR with Restricted Token
Pull requests from forks get a read-only token by default β even if your workflow declares write permissions. This is a security feature. You cannot override it.
Never use pull_request_target with actions/checkout@v4 pointing to the fork's code to "work around" the fork permission restriction. This is a critical security vulnerability β the fork's code runs with write access to your repository. Use pull_request (safe, read-only) and handle writes in a separate workflow triggered by a comment or label.
Case 5: Organization Policy Restriction
Org admins can restrict which permissions Actions workflows can request. If your workflow needs packages: write but the org policy caps it at read, you'll get a 403. Fix: Ask your org admin to update the Actions permissions policy under Organization Settings β Actions β General β Workflow permissions.
Full permissions Reference
# All available permissions (set individually as needed) permissions: actions: read|write|none checks: read|write|none contents: read|write|none deployments: read|write|none id-token: write|none issues: read|write|none packages: read|write|none pages: read|write|none pull-requests: read|write|none repository-projects: read|write|none security-events: read|write|none statuses: read|write|none
π Scenario 4 β Secrets Issues
Symptom: Secret is empty, secret appears in logs, or secret comes from the wrong scope.
When you reference a secret that doesn't exist β ${{ secrets.TYPO }} β GitHub Actions returns an empty string. It does NOT throw an error. This is by far the #1 cause of "my secret isn't working" issues.
Case 1: Secret Name Typo
# β Typo β returns empty string, NO error
- run: echo ${{ secrets.ACR_PASWORD }}
# β
Correct name
- run: echo ${{ secrets.ACR_PASSWORD }}
Case 2: Secret Not Available in Fork PRs
This is by design. Pull requests from forks cannot access repository secrets β this prevents malicious forks from stealing your credentials. The secret resolves to an empty string.
Case 3: Secret Exposed in URL
# β DANGEROUS β URL is logged, secret is visible!
- run: git clone https://user:${{ secrets.TOKEN }}@github.com/org/repo.git
# β
Safe β use environment variable, mask is preserved
- run: git clone https://user:${TOKEN}@github.com/org/repo.git
env:
TOKEN: ${{ secrets.TOKEN }}
When a secret is interpolated directly into a run: command, the value is injected before the shell sees it. If that value appears in a URL and the URL gets logged (e.g., by git), the masking system may not catch it because the value appears as part of a larger string.
Case 4: Secret Exposed via Base64 / Transformation
# β Encoded value is logged β masking only hides the original
- run: echo ${{ secrets.KEY }} | base64
# β
If you must encode, suppress output
- run: echo ${{ secrets.KEY }} | base64 > encoded.txt
Case 5: Environment Secret Not Loaded
# β Environment secrets require the "environment" key
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- run: echo ${{ secrets.PROD_KEY }} # Empty!
# β
Declare the environment
jobs:
deploy:
runs-on: ubuntu-latest
environment: production # Now PROD_KEY is available
steps:
- run: echo ${{ secrets.PROD_KEY }}
Case 6: Organization Secret Not Shared
Organization-level secrets must be explicitly shared with specific repositories. Go to Organization Settings β Secrets and variables β Actions β click the secret β check that your repo is in the "Repository access" list.
Secret Debugging Workflow
Use this reusable step to safely verify whether secrets are set without leaking them:
- name: Verify secrets are set
run: |
errors=0
if [ -z "$ACR_PASSWORD" ]; then
echo "::error::ACR_PASSWORD secret is not set!"
errors=$((errors + 1))
else
echo "β
ACR_PASSWORD is set (length: ${#ACR_PASSWORD})"
fi
if [ -z "$KUBE_CONFIG" ]; then
echo "::error::KUBE_CONFIG secret is not set!"
errors=$((errors + 1))
else
echo "β
KUBE_CONFIG is set (length: ${#KUBE_CONFIG})"
fi
if [ $errors -gt 0 ]; then
echo "::error::$errors secret(s) missing. Check repository settings."
exit 1
fi
env:
ACR_PASSWORD: ${{ secrets.ACR_PASSWORD }}
KUBE_CONFIG: ${{ secrets.KUBE_CONFIG }}
env:?Passing secrets via env: (instead of inline ${{ secrets.X }}) ensures the masking engine always recognizes the value. It also avoids shell injection β a malicious secret value containing ; rm -rf / would be treated as a literal string in an environment variable, not as a shell command.
π Scenario 5 β Docker Build / Push Failures
Symptom: "ERROR: failed to solve", "denied: requested access to the resource is denied", or the build takes 30+ minutes.
Case 1: Dockerfile Not Found
ERROR: failed to solve: failed to read dockerfile: open Dockerfile: no such file or directory
Cause: The context or file parameter in your Docker build action points to the wrong path.
# β Dockerfile is in ./app/ but context is root
- uses: docker/build-push-action@v5
with:
context: .
file: ./Dockerfile # File not in root!
# β
Correct path
- uses: docker/build-push-action@v5
with:
context: ./app
file: ./app/Dockerfile
Case 2: Build Fails in CI but Works Locally
Three common causes:
- Missing build args: You have
ARG API_KEYin the Dockerfile but don't pass--build-argin CI. - Docker version mismatch: Ubuntu runner has a different Docker/BuildKit version. Pin with
docker/setup-buildx-action@v3. - Platform mismatch: You develop on Apple Silicon (arm64) but CI runs on x86_64 (amd64). Add
platforms: linux/amd64explicitly.
Case 3: Push Denied
denied: requested access to the resource is denied
Causes:
- Not logged in β add
docker/login-action@v3before the build step. - Wrong registry URL β
myacr.azurecr.iovsghcr.io/owner. - Expired credentials β rotate the service principal password or PAT.
- Missing
packages: writepermission for GHCR.
Case 4: Slow Builds (No Layer Caching)
Without caching, Docker rebuilds every layer from scratch on every CI run. Add GitHub Actions cache:
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: myacr.azurecr.io/myapp:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
This caches Docker layers in GitHub Actions cache. Subsequent builds only rebuild changed layers, cutting build times from 10+ minutes to under 1 minute.
Case 5: Multi-Platform Build Errors
If you need both linux/amd64 and linux/arm64 images, set up QEMU first:
- uses: docker/setup-qemu-action@v3
- uses: docker/setup-buildx-action@v3
- uses: docker/build-push-action@v5
with:
platforms: linux/amd64,linux/arm64
push: true
tags: myacr.azurecr.io/myapp:${{ github.sha }}
π Scenario 6 β Helm / AKS Deployment Failures
Symptom: "UPGRADE FAILED", "timed out waiting for the condition", "ImagePullBackOff"
Case 1: helm upgrade --install Timeout
Error: UPGRADE FAILED: timed out waiting for the condition
This almost always means the new pods are crash-looping and never become Ready. Debug steps:
# 1. Check pod status kubectl get pods -n production -l app=myapp # 2. Check why the pod is failing kubectl describe pod <pod-name> -n production # 3. Check application logs kubectl logs <pod-name> -n production --previous # 4. Check events for the namespace kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
Case 2: ImagePullBackOff
Warning Failed Back-off pulling image "myacr.azurecr.io/myapp:abc123"
Root causes:
- AKS can't authenticate to ACR β Run
az aks check-acr --name myaks --resource-group myrg --acr myacr.azurecr.io - ACR not attached to AKS β Run
az aks update -n myaks -g myrg --attach-acr myacr - Image tag doesn't exist β The CI pushed tag
abc123but Helm values uselatest. Verify:az acr repository show-tags -n myacr --repository myapp
Case 3: Wrong Helm Values
# β Common mistakes in CI
helm upgrade --install myapp ./charts/myapp \
--set image.tag=${{ github.sha }} # Missing quotes around SHA!
--namespace production # Missing --create-namespace on first deploy
# β
Correct
helm upgrade --install myapp ./charts/myapp \
--set image.tag="${{ github.sha }}" \
--namespace production \
--create-namespace \
--wait \
--timeout 5m
Case 4: --atomic Causes Silent Rollback
When --atomic is set, Helm automatically rolls back on failure β but the error output is minimal. To see why it failed, add --debug:
helm upgrade --install myapp ./charts/myapp \ --atomic \ --debug \ --timeout 5m \ --namespace production 2>&1 | tee helm-output.log
Case 5: Resource Quota Exceeded
Error creating: pods "myapp-xyz" is forbidden: exceeded quota: default-quota, requested: cpu=500m, used: cpu=900m, limited: cpu=1000m
Fix: Either reduce the pod's resource requests in values.yaml, or ask the cluster admin to increase the ResourceQuota for the namespace.
Case 6: Ingress Not Working
The deploy succeeds, pods are Running, but the app isn't accessible externally. Checklist:
- Is the Ingress Controller installed?
kubectl get pods -n ingress-nginx - Does the Ingress resource exist?
kubectl get ingress -n production - Is cert-manager issuing the TLS certificate?
kubectl describe certificate -n production - Does DNS point to the Ingress Controller's external IP?
nslookup myapp.example.com
π Scenario 7 β Runner Issues
Symptom: "Waiting for a runner to pick up this jobβ¦" β the job sits queued for minutes or hours, or you see "No runner matching the specified labels was found."
Case 1: Wrong Runner Label
# β Label doesn't match any runner runs-on: self-hosted-linux # β Labels are comma-separated (array), not hyphenated runs-on: [self-hosted, linux]
Case 2: Self-Hosted Runner Offline
Check the runner status at Settings β Actions β Runners. If the runner shows "Offline":
- SSH into the runner machine and check the service:
sudo systemctl status actions.runner.* - Restart:
sudo systemctl restart actions.runner.* - Check logs:
journalctl -u actions.runner.* --since "1 hour ago"
Case 3: GitHub-Hosted Runner Capacity
During peak times, GitHub-hosted runners may take longer to provision. If jobs are queued for more than 5 minutes:
- Check GitHub Status for incidents.
- Consider using larger runners (GitHub Teams/Enterprise) for priority queuing.
- Add a
timeout-minutesto prevent jobs from hanging indefinitely.
Case 4: Job Exceeds Maximum Runtime
jobs:
build:
runs-on: ubuntu-latest
timeout-minutes: 15 # Default is 360 (6 hours). Set a sane limit.
steps:
- uses: actions/checkout@v4
- run: npm test
Always set timeout-minutes on your jobs. Without it, a hanging process (like a test waiting for user input) will burn 6 hours of runner time before GitHub kills it. For most CI jobs, 15β30 minutes is a generous limit.
π Debugging Decision Tree
When a workflow fails, walk this decision tree from top to bottom to find your issue category fast:
Workflow failed?
β
ββ No run appeared in Actions tab
β ββ File not in .github/workflows/? β Move it
β ββ File extension not .yml/.yaml? β Rename it
β ββ YAML syntax error? β Validate with yq or VS Code
β ββ Branch/path filter doesn't match? β Update trigger filters
β ββ Actions disabled on repo/fork? β Enable in Settings
β ββ Wrong event type? β Check on: trigger
β
ββ Run appeared, but job was skipped
β ββ if: condition evaluated to false? β Check expression logic
β ββ needs: dependency job failed? β Fix the upstream job first
β
ββ Job started, but a step failed
β ββ Permission error (403)? β Add permissions: block
β ββ Secret is empty? β Check name, scope, fork policy
β ββ Docker build/push error? β Check login, Dockerfile, context
β ββ Helm/K8s deployment error? β Check kubeconfig, values, image
β ββ Test failure? β Check test config, service containers
β ββ Timeout? β Set timeout-minutes, check for hangs
β
ββ Job succeeded, but result is wrong
ββ Output not passed between jobs? β Check outputs:, needs: syntax
ββ Artifact missing? β Check upload/download action versions
ββ Wrong environment? β Verify environment: key
ββ Cached stale data? β Clear cache or change key
π Quick Reference: Error Messages
Bookmark this table. It maps the exact error messages you'll see to their most likely cause and quickest fix.
| Error Message | Likely Cause | Quick Fix |
|---|---|---|
Invalid workflow file |
YAML syntax error (tabs, missing colons) | Validate with yq or VS Code extension |
Resource not accessible by integration |
GITHUB_TOKEN missing write permissions |
Add permissions: block to workflow |
HttpError: 403 |
Token scope insufficient for the API call | Check required permission for the specific API |
denied: requested access to the resource is denied |
Not logged in to container registry, or wrong registry URL | Add docker/login-action@v3 before push step |
No runner matching the specified labels |
Runner label mismatch or self-hosted runner offline | Verify runs-on: label matches available runners |
Process completed with exit code 1 |
Shell command returned non-zero exit code | Check command output above the error line |
UPGRADE FAILED: timed out waiting for the condition |
Helm deploy pods are crash-looping or not becoming Ready | kubectl describe pod and kubectl logs |
ImagePullBackOff |
AKS can't pull image from registry | az aks check-acr / attach ACR / verify tag |
Error: Process completed with exit code 128 |
Git authentication failure (clone, push, fetch) | Check GITHUB_TOKEN or PAT has repo access |
Could not find artifact |
Artifact upload/download name mismatch or expired (90 days) | Verify names match between upload and download steps |
The template is not valid |
Expression syntax error in ${{ }} |
Check for unclosed brackets, invalid function names |
Unrecognized named-value: 'env' |
Using env. context where it isn't available |
Environment variables aren't available in if: at job level β use vars. instead |
JsonPayloadError: request entity too large |
Artifact or annotation payload exceeds size limit | Reduce payload size; annotations have a 64KB limit |
Error: Timeout has been exceeded |
Step or job exceeded timeout-minutes |
Increase timeout or optimize the slow step |
The action requires a node20 runtime |
Using an outdated action version with deprecated Node.js | Update the action to the latest version (@v4) |
Input required and not supplied: token |
Action expects an input that isn't provided | Check the action's README for required with: inputs |
π§ͺ Hands-on Labs
Lab 1: Fix All 5 Errors in a Broken Workflow
Copy this broken workflow into your repo as .github/workflows/lab1.yml and fix all the errors until it runs green:
# Lab 1: Fix all 5 errors
name: Lab 1 Broken
on:
push:
branches: [main]
jobs:
greet:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Node
uses: actions/setup-node@v4
with:
node-version: 18
- name: Print Greeting
run: echo "Hello, ${{ github.actor }}"
- name: Generate Report
run: |
echo "report=success" >> $GITHUB_OUTPUT
- name: Read Report
run: echo "Report: ${{ steps.generate.outputs.report }}"
- One step is trying to read an output from another stepβhow does it know which step?
- The output name used in the reference must match the
id:of the producing step. - Try adding
id: generateto the "Generate Report" step.
Lab 2: Debug a "Secret Not Found" Scenario
Create a workflow that uses a secret called MY_API_KEY but intentionally misconfigure it. Then fix it step by step:
- Create a repository secret named
MY_API_KEYwith valuesk-test-123. - Create this workflow (it has 3 bugs related to secrets):
name: Lab 2 Secret Debug
on: workflow_dispatch
jobs:
use-secret:
runs-on: ubuntu-latest
# Bug 1: Missing "environment: staging" β if secret is environment-scoped
steps:
- name: Print key length
# Bug 2: Using secrets.MY_API_KEYS (typo β extra S)
run: |
if [ -z "${{ secrets.MY_API_KEYS }}" ]; then
echo "Secret is empty!"
else
echo "Secret is set"
fi
- name: Use key in URL
# Bug 3: Secret in URL will be logged
run: curl "https://api.example.com?key=${{ secrets.MY_API_KEY }}"
Fix all 3 bugs: correct the secret name, add environment: if needed, and pass the secret via env: instead of inline.
Lab 3: Fix a Docker Build That Works Locally but Fails in CI
Your Dockerfile uses a build argument for the API URL, but CI doesn't pass it:
# Dockerfile FROM node:20-alpine ARG API_URL ENV API_URL=$API_URL WORKDIR /app COPY package*.json ./ RUN npm ci COPY . . RUN npm run build # Fails because API_URL is undefined during build
# Workflow step β missing build-args
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: myacr.azurecr.io/myapp:${{ github.sha }}
Fix: Add the build-args parameter:
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: myacr.azurecr.io/myapp:${{ github.sha }}
build-args: |
API_URL=https://api.example.com
Lab 4: Diagnose a Helm Deployment Timeout
Your workflow shows UPGRADE FAILED: timed out waiting for the condition. Walk through this debugging sequence in order:
# Step 1: Check what Helm sees helm list -n production helm history myapp -n production # Step 2: Check pod status kubectl get pods -n production -l app=myapp # Look for: CrashLoopBackOff, ImagePullBackOff, Pending # Step 3: If CrashLoopBackOff β check app logs kubectl logs -l app=myapp -n production --previous --tail=50 # Step 4: If ImagePullBackOff β check image details kubectl describe pod -l app=myapp -n production | grep -A5 "Events" # Step 5: If Pending β check resource availability kubectl describe pod -l app=myapp -n production | grep -A5 "Conditions" kubectl top nodes # Step 6: Roll back to the last working version helm rollback myapp -n production
π Bonus: Workflow Annotations
Use these special commands in your run: steps to create annotations that appear directly on the workflow run summary and in pull request checks:
- name: Check code quality
run: |
# Error β creates a red β annotation
echo "::error file=src/app.js,line=42,col=5::Undefined variable 'config'"
# Warning β creates a yellow β οΈ annotation
echo "::warning file=src/app.js,line=10::Consider using const instead of let"
# Notice β creates a blue βΉοΈ annotation
echo "::notice::Build completed in 45 seconds"
# Group β collapses log output into a named section
echo "::group::Test Results"
cat test-results.txt
echo "::endgroup::"
The ::error:: annotation format with file= and line= parameters adds inline annotations to pull request diffs β just like a linter. Use this in custom validation scripts to give developers precise, file-level feedback.
π¬ Interview Questions
Beginner (Conceptual)
A: Three ways: (1) Create a repository secret ACTIONS_STEP_DEBUG set to true for step-level verbose logging, (2) Create ACTIONS_RUNNER_DEBUG set to true for runner-level diagnostics, (3) Re-run a failed workflow and check the "Enable debug logging" checkbox β this enables debug for that single re-run without creating secrets. Debug mode increases log volume significantly but shows internal action details, environment variable resolution, and every command being executed.
A: No, the workflow does not fail. GitHub Actions returns an empty string for any undefined secret. This is a deliberate design choice to avoid leaking information about which secrets exist. The consequence is that a typo like secrets.ACR_PASWORD (missing S) silently returns empty, and your step may succeed but use blank credentials β causing confusing downstream failures. Best practice: add a verification step that checks -z "$SECRET_NAME" and explicitly fails if a required secret is empty.
::error:: workflow command format used for?A: It creates annotations on the workflow run. The full format is ::error file={path},line={line},col={col}::{message}. When used with the file parameter, it adds inline annotations to pull request diffs β similar to how linters highlight issues. There are three severity levels: ::error:: (red), ::warning:: (yellow), and ::notice:: (blue). You can also use ::group::Name and ::endgroup:: to create collapsible log sections.
A: GitHub Actions are disabled by default on forks. The fork owner must go to the Actions tab and click "I understand my workflows, go ahead and enable them." Additionally, if they're trying to trigger workflows via pull request to the upstream repo, fork PRs from first-time contributors require manual approval from a maintainer. Scheduled workflows (on: schedule) also only run on the default branch of the original repo, not on forks.
GITHUB_TOKEN permission since 2023 for new repositories?A: Read-only for contents and metadata. All other permissions (packages, pull-requests, issues, etc.) default to none. This was changed from the previous default of broad write access as a security hardening measure. Any workflow that needs to write β push packages, comment on PRs, create releases β must explicitly declare permissions using the permissions: key at the workflow or job level.
Intermediate (Technical)
ImagePullBackOff error in a GitHub Actions AKS deployment?A: Step-by-step: (1) kubectl describe pod <name> β look at the Events section for the exact pull error, (2) Verify the image exists: az acr repository show-tags -n myacr --repository myapp, (3) Check AKS-to-ACR authentication: az aks check-acr --name myaks --resource-group myrg --acr myacr.azurecr.io, (4) If authentication fails, attach ACR: az aks update -n myaks -g myrg --attach-acr myacr, (5) Check if the image tag in Helm values matches what CI actually pushed β a common bug is the SHA tag in the workflow not being passed correctly to --set image.tag.
actions/cache but builds are still slow. What could be wrong?A: Common causes: (1) The cache key changes every run (e.g., includes a timestamp), so it never hits, (2) The cache path doesn't match where the tool actually stores files β for npm it should be ~/.npm, not node_modules/, (3) The cache was evicted β GitHub limits cache to 10 GB per repo and evicts least-recently-used entries, (4) For Docker builds, you're using actions/cache instead of the native BuildKit cache (cache-from: type=gha), which is more efficient, (5) The restore-keys fallback pattern is too broad, restoring an incompatible old cache that gets discarded during install anyway.
pull_request and pull_request_target trigger events, and why is one dangerous?A: pull_request runs workflow code from the PR head branch (the fork's code) but with a read-only token and no access to secrets. pull_request_target runs workflow code from the base branch (your repo's main) but with a write token and full secret access. The danger: if you use pull_request_target and then actions/checkout to check out the PR's code (ref: ${{ github.event.pull_request.head.sha }}), a malicious fork can modify the workflow to exfiltrate secrets. The rule: never check out untrusted code in a pull_request_target workflow.
A: (1) Re-run with debug logging enabled for verbose output, (2) Add set -x at the top of the run: block to print every command before execution, (3) Add set -euo pipefail to make the shell fail on the exact line that errors instead of continuing, (4) Check if the command writes to stderr β GitHub sometimes truncates stderr output. Redirect with 2>&1 to merge streams, (5) If using a third-party action, check the action's source code on GitHub β many actions swallow error details in their catch blocks.
A: Several approaches: (1) Add workflow_dispatch trigger and run manually from any branch via the Actions tab, (2) Use nektos/act to run locally, (3) Open a PR β pull_request events trigger the workflow from the PR branch, (4) Temporarily update branches: filter to include your feature branch: branches: [main, my-feature], (5) Use a draft PR if you don't want reviews but need the workflow to run. Remember that on: push with branches: [main] will only trigger on pushes to main β it won't trigger on your feature branch unless you add it.
Scenario-Based (Advanced)
A: This is a reversed version of the typical problem. Possible causes: (1) Actions disabled on the main repo β check Settings β Actions β General, (2) Branch protection rules on the main repo requireapproval to run workflows, (3) The workflow's on: trigger is configured differently in the main repo's branch β perhaps the main branch has an older version of the workflow file, (4) Organization policy restricts which workflows can run β check org-level Actions settings, (5) The colleague's fork has a workflow_dispatch trigger and they're running it manually, but the main repo only triggers on push and they're not pushing to it, (6) Concurrency control on the main repo is cancelling or queuing their runs behind other deployments.
A: Systematic approach: (1) Verify Helm release: helm list -n production β is the revision number incremented? Check helm history myapp -n production, (2) Check the image tag: kubectl get deployment myapp -n production -o jsonpath='{.spec.template.spec.containers[0].image}' β does it show the new SHA?, (3) Check rollout status: kubectl rollout status deployment/myapp -n production β did the new pods actually replace the old ones?, (4) Check pod age: kubectl get pods -n production β are the pods recently created or still the old ones?, (5) DNS/CDN caching: the deploy may be correct but a CDN or browser is serving cached content. Check with curl -H 'Cache-Control: no-cache', (6) Wrong namespace: the deploy went to staging instead of production β a common --namespace mix-up, (7) Ingress routing: the old version is still served because the Ingress hasn't updated. Check kubectl describe ingress -n production.
A: (1) Re-run only the failed matrix job with debug logging to get verbose output, (2) Check Node 16 EOL status β Node 16 reached end-of-life in September 2023, and many packages drop support for it. Check if a dependency updated and dropped Node 16 compatibility, (3) Lock the dependency versions: run npm ci (not npm install) with exactly the same package-lock.json locally on Node 16 to reproduce, (4) Check for syntax issues: Node 16 doesn't support some ES2022+ features like structuredClone, top-level await in CommonJS, or the Array.findLast() method, (5) Add a diagnostic step before the failing command: node --version && npm --version && npm ls to verify the exact environment, (6) Consider dropping Node 16 from the matrix since it's past EOL β this is often the correct fix.
A: Flaky workflows usually fall into these categories: (1) Race conditions in tests: tests depend on timing, service startup, or order of execution. Fix with proper wait/retry logic and avoid sleep in favor of polling, (2) Rate-limited API calls: external APIs return 429 errors under load. Add retry with exponential backoff, (3) Resource contention: tests use shared resources like ports or databases. Use random ports and isolated test databases, (4) Network instability: package downloads, Docker pulls, or external service calls fail intermittently. Add retry logic or use caching to avoid network calls, (5) Runner resource exhaustion: parallel tests consume all available memory/CPU. Reduce parallelism or use a larger runner. To diagnose: download logs from multiple runs, diff the failing and passing logs, and look for the first line that diverges.
A: Response (immediate): (1) Revoke the secret immediately β rotate the API key, password, or token. Assume it's compromised, (2) Remove from git history: use git filter-branch or BFG Repo Cleaner to purge the commit, then force-push. Contact GitHub support to clear cached views, (3) Delete the workflow run log that exposed the value: Actions tab β click the run β gear icon β "Delete all logs", (4) Audit access: check if the secret was used during the exposure window. Prevention: (1) Enable GitHub Secret Scanning (free for public repos, available with GHAS for private), (2) Add .gitignore rules for .env, .secrets, etc., (3) Use pre-commit hooks like detect-secrets to block commits containing secrets, (4) Use repository/environment secrets instead of hardcoded values, (5) Add trufflehog or gitleaks to your CI pipeline to catch leaks before merge.
π Summary
- Debugging toolkit: Enable
ACTIONS_STEP_DEBUGandACTIONS_RUNNER_DEBUGsecrets, re-run with debug logging, download log archives, and usenektos/actfor local testing. - YAML errors: Tabs kill workflows silently. Always use spaces. Validate locally with
yqor VS Code. - Trigger issues: If no run appears, check file path, extension, branch/path filters, and YAML syntax β in that order.
- Permission errors: New repos default to read-only tokens. Always add an explicit
permissions:block for any workflow that writes. - Secret gotchas: Typos return empty strings (not errors). Always verify secrets with a check step. Never interpolate secrets into URLs or encode them.
- Docker failures: Check context/file paths, pass build args, log in before push, and enable
cache-from: type=ghafor fast builds. - Helm/AKS issues:
helm upgradetimeouts usually mean crash-looping pods. Usekubectl describe podandkubectl logs --previousto find the root cause. - Runner problems: Verify
runs-on:labels match available runners. Always settimeout-minutesto avoid runaway jobs. - Decision tree: No run β check triggers. Skipped β check conditions. Step failed β check the error category. Succeeded but wrong β check outputs and caching.